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PREFACE 


\ 


Fifty years ago a large section of general public were not 

only uninterested in what we now call the social problem, but they 

scarcely gave a thought to the existence of such a problem. They 
felt vaguely perhaps, during periods of acute distress ilue to lack of 
employment, that all was not well and they thought the Govern¬ 
ment or possibly the big landowner was to blame, but only the 

more enlightened realized the complexity of the body politic and 

how fearfully and wonderfully it is made. To-day all this is changed, 
and comparatively few imagine that a single panacea—the pro¬ 
hibition of drink, the nationalization of land, or a levy on capital— 
will cure all evils. 

The very fact that nearly the whole civilized world has given 
itself up for over four years to the destruction of life and the dragging 
down of the social fabric in all countries on so vast a scale has 
led to a surfeit and a reaction in which thoughtful men are eager 
to take part in proclaimiag again a common brotherhood and in 
building a better world. Those who have always been interested 
in this kind of architecture welcome the change of spirit, but they 
also recognize the difficulty of the task undertaken and the need 
for no little mental effort to second the good-will, which is the first 
essential for success. To pull down no teacher is needed, but we 
most loam to build. 

This leads one to the subject of the present book. The man who 
wishes his work to stand must make sure of its foundations. He 
cannot afford to rest satisfied, as too often the politician and social 
worker do, with wild and ill-informed generalizations where more 
exact knowledge is possible, and there are few human problems in 
the discussion of which some acquaintance with the proper treat¬ 
ment of statistics is not in the highest degree necessary. 
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Most people, however, are suspicious of figures. They imagine 
that cjuantitative considerations must of necessity deaden all 
feeling for the purely aesthetic or qualitative spirit which is the 
very life of the phenomena observed or measured. But this surely 
need not be the case. Kepler, when he succeeded in translating 
the motions of the planets into the language of number was not, we 
believe, the less but rather the more enamoured of the beauty and 
order with which the whole of creation is clothed. 

A second reason for suspicion is that partisans of one school or 
another with more push than principle sometimes trade upon the 
general ignorance of statistics to ‘ prove ’ their own pet theories, 
while others no less enthusiastic lead the credulous public into the 
ditch, not with malice intent, but because they are really blind 
themselves to the right interpretation of the figures they so glibly 
quote. 

Although a concern in social questions led the present writer in 
the first instance to study the theory of statistics, there is no reason 
why this bias should prevent the book being of service to those who 
wish to know something of its application in other directions, seeing 
that the general principles uDdorl3ring the theory are the same in 
all cases, and illustrations have been taken from any field, biological, 
economic, medical, etc., just as they suited the immediate purpose 
in view. 

The author makes no claim to any originality : he is no more 
than a student seeking to put together, with some kind of system 
and as he understands them, the simpler and more important ideas 
be has gathered from other sources. The matter is entirely the 
work of others, the manner only is his own, and he will be happy 
to receive criticism if thereby he may learn more. TTi« chief quali¬ 
fication for writing is that he has had to worry through most of 
his difficulties alone, and consequently he knows where another 
student is likely to be in trouble better perhaps than the kind of 
writer who is so quick as to be able to see through things at a glance 
or, failing that, so fortunate as to be able to borrow immediate 
light from others. 

The book is divided into two parts. Practically all the first part 
should be well within the understanding of the ordinary person. 
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Part II. ia more mathematical, but au e0ort has been made tlirougb- 
out to explain results in such a way that the reader shall gain a 
general idea of the theory and be able to apply it without needing 
to master all the actual proofs. The whole is meant, not as an 
exhaustive treatise, but merely as a first course introducing the 
reader to more serious works, and, since real inspiration is to be 
found nowhere so surely as at the source, it is intended to encourage 
and fit him to pursue the subject further by consulting at least the 
most important original papers referred to in the text, only enough 
references being given to awaken curio.sity. With the same inten¬ 
tion a short chapter is inserted after the Appendix by way of .sug¬ 
gesting a few of the sources of statistics likely to be of interest to 
the social student. 

Some living writers, notably Professor Karl Pearson, have 
contributed so largely to the development and application of 
statistics that it is impossible to write upon the subject at all without 
incorporating large parts of their work, and the least one can do 
is gladly to record the benefit and pleasure one has received from 
them. The author’s indebtedness to the two most important 
English text-books—Yule’s Theory of Statiatics and Rowley’s 
Elements of Siatisiics —will be evident also to any one who knows 
these books, for they became so familiar through constant study 
that he fears he may have drawn upon them unconsciously even 
to the point of plagiarism in places. 

Finally, he wishes specially to acknowledge the kindness of four 
friends—Mr. Peter Fraser, Lecturer in Mathematics at Bristol Uni¬ 
versity, without whose encouragement in the early stages the work 
would never have been attempted ; Professor H. T. H. Piaggio, 
University College, Nottingham, and Mr. A. W. Young, sometime 
Lecturer at the Sir John Cass Technical Institute, London, whose 
criticisms and suggestions were most valuable; and Professor 
W. P. Milne, of Leeds University, who, both as a practical teacher 
and aa Editor of this aeries, ungrudgingly gave his help and advice, 

D. C. J. 



NOTE TO THE SECOND EDITION 

The kindly reception given to my book leads me to think that 
it might appeal to a wider circle of readers if they were not 
frightened by the mathematical appearance of certain pages in the 
second part. With this new issue it has been made possible to 
obtain Part 1. separately. 

A selection of examples from London. B.Sc. (Econ.) papers has 
been included by kind permission of the Authority concerned. It 
is hoped that these may prove useful to students. An Index has 
also been added. Otherwise no changes of importance have been 
made in the text. 

D. 0. J. 

SepUmbtr 1924. 
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PART I 


CHAPTER I 

INTRODUCTORY 

Early Historical Beginnings. Statistics, more or less valuable, 
have been compiled m most civilized countries from very earlv 
times. One reason for doing this on a large scale has been to 
ascertain the man-power and material strength of the naticui for 
military or fiscal purposes, and we read in the Old Testament of 
such censuses being taken in the case of the Jews, while among the 
Romans also it tvas a common practice. 

In England, as economic terms began to be used and their mean¬ 
ings analysed, and especially during the period when the mercantile 
system prevailed, and the Government endeavoured so far as was 
practicable to direct industry into cliannels such that it would add 
most to the power of the realm, men tried frequently to base argu¬ 
ments for social and political reform upon the results of figures 
collected. A distinct advance had been made in the seventeenth 
century when mortaUty tables were drawn up and discussed by Sir 
William Petty and Halley, the famous astronomer, among others, 
and their labours prepared the way for a more scientific treatment 
of statistical methods, especially at the bands of one, Siissmilch, a 
Prussian clergyman, who published an important work in 1761. 

It is almost true to say, however, that until the time of the great 
Belgian, Quctelet (1796-1874). no substantial theory of statistics 
existed. The justice of this claim will be recognized when we 
remark that it was he who reaUy grasped the significance of one 
of the fundamental principles—sometimes spoken of as the constancy 
of great nurniew—upon which the theory is based. A simple iUus- 
tration will explain the nature of this important idea : Imagine 
100,000 Englishmen, all of the same age and living under the same 
normal conditions—ruling out, that is, such abnormalities as are 
occasioned by wars, famines, pestilence, etc. Let us divide these 
men at random into ten groups, containing 10,000 each, and note 
the age of every man when he die«. Quetelet’s principle lays 

A 
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down that, although wo cannot foretell how long any particular 
individual will live, the ages at death of the 10,000 added together, 
whichever group wo consider, will be practically the same. De* 
pending upon this fact insurance companies calculate the premiums 
they must charge, by a process of averaging mortality results re¬ 
corded in the past, and so they are able to carry on business without 
serious risk of bankruptcy. 

As a distinguished statistician once said, ‘ By the use of statistics 
we obtain from milliards of facts the grand average of the world.’ 
But if the average resulting from our observations were subject to 
violent fluctuation as we passed from one set of facts to another 
cognate set there would be little satisfaction in findin g it. It is 
the comparative constancy of the average, if the number of our 
observations is large enough, which makes it so important, as 
Quetelet observed, for although the idea was not altogether new he 
first realized how wide an application it had and how fruitful of 
practical results it might prove. 

Quetelet was bom in Ghent, and taught mathematics in the 
College there in his early youth. After graduating as Doctor of 
Science he became Professor of Mathematics in Brussels Athenaeum 
when only twenty-three years old, and later he was made Director 
of the Brussels Observatory, in the foundation of which he had 
taken a leading part. In 1841 he was appointed President of the 
Central Commission of Statistics, w’here he was in a position to 
render valuable assistance to the Belgian Government by his advice 
on important social questions. He initiated the International 
Statistical Congress, which has served to bring together the leading 
statisticians of all countries, and the first meeting was held in 1853 
at Brussels. His death occurred at the ripe ago of seventy-eight. 

Some idea of the extent of Quetelet’s statistical researches may 
be gathered from the titles of his chief works : (1) Sur Vkomme et 
le diveloppement de ses faculUs, ou essai de physique sociale (1835); 
(2) Leltres . . . sur la thiorie des probabiliiis appliquie aux sciences 
morales et politiques (1846); (3) Du syst^ social et des lois qui U 

r&gissent (1848) ; (4) L'ArUhroponUtrie, ou mesure des diffirenies 
facuUis de Vhomme (1871). 

In his writings he visualizes a man with qualities of average 
measurement, physical and mental iyhomme moyen), and shows 
how all other men, in respect of any particular organ or character, 
can be ranged about the mean or average man, just as in Physics 
a number of observations of the same thing are ranged about 
the mean of all the observations. Hence he concluded that the 
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methods of Probability, which are so effective in discussing errors 
of observation, could be used also in Statistics, and that deviations 
from the mean in both cases would be subject to the binomial law. 

Hain in Vienna put some of Quetelet’s ideas to good service in 
1852, employing a superior method for the calculation of statistical 
variability. Knapp and Lexis in Germany, also following up 
Quetelet’s principles, made an exhaustive investigation several years 
later of the statistics of mortality, and their work has been extended 
in many directions, and in our own time notably by Galton, Karl 
Pearson, and Edgeworth. 

The name of Sir Francis Galton (1822-1911), to whose work as 
a pioneer the science of Statistics owes so much, is deserving of 
even greater honour than it has yet received. Founder of the School 
of Eugenics, Galton himself came of famous stock, being grandson 
of Erasmus Darwin and a cousin to Charles Darwin. He studied 
medicine in early youth, but after graduating at Cambridge his 
attention was turned to exploration, and the Royal Geographical 
Society awarded him a gold medal on the results of his investiga¬ 
tions in South-West Africa. His first great work on heredity w’as 
not published till 18G9, after he had already earned distinction in 
other directions, for he was elected a Fellow of the Royal Society 
in 1800. Alive with new ideas, marvellously patient and persistent 
in bringing them to the test of observation—qualities essential for 
real scientific research—he set himself to inquire into the laws 
governing the transmission of characteristics, physical and mental, 
from one generation to another. Large tracts of this ground have 
since been carefully explored and mapped out by the school of 
his great successor, Karl Pearson, who has originated formulae for 
testing the extensive anthropometrical and biological data col¬ 
lected. Largely as a result of their work it is now widely recognized 
that ‘ the whole problem of evolution,’ a.s Professor Pearson himself 
has well said, ‘ is a problem in vital statistics—a problem of longevity, 
of fertility, of health, and of disease, and it is as impossible for the 
evolutionist to proceed without statistics as it would be for the 
Registrar-General to discuss the national mortality wdthout an 
enumeration of the population, a classification of deaths, and a 
knowledge of statistical theory.’ 

Logical Development. The best way to approach the study of 
any subject, if one had time, would be along the lines of its historical 
development, but these lines seem so often to diverge from the 
main theme, like branches from the parent stem of a tree, that 
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wlien one tries to describe them the general effect is apt to be some¬ 
what confusing. It is therefore usually the custom to adopt a 
logical rather than a historical sequence, but it may assist the reader 
to see the comiection between the two and the unity which embraces 
the w liole if we now briefly trace the natural growth of the subject, 
suggesting the step.s we might expect it logically to take. This we 
have tried to keep in view as nearly as possible in the succeeding 
chapters, except that the order may have been altered here and 
matter may have been omitted or inserted there as reason and 
the elementary nature of the work dictated :— 

1. Owing to the difficulty which the mind experiences in grasping 
a large mass of figures, the necessity for an average arises to sum 
up shortly the character of the mass, and various kinds of averages 
are proposed. 

2. An average proves insufficient alone to define the whole scheme 
of observations, and other constants are invented to measure their 
spread or dispersion about the average. 

3. Considerations of space and the desire for some kind of system 
lead furtlier to the formation of tables with the observations classi¬ 
fied in ordered groups. 

4. Ihe formation of these tables suggests the possibility of a 
graphical representation of the numbers in the different groups to 
bring out the nature of their distribution. 

5. The impossibility of dealing with a whole population results 
in the selection of samples, and the comparison of one sample with 
another introduces the subject of random errors. 

6. The closer examination of this subject leads us into the domain 
of mathematical probability and discovers the probability curve, or 
normal curve of error, first formulated in connection with the study 
of errors of observation. 

7. This same curve serves in the sequel to describe a certain 

important type of statistical distribution, in which each observation 

is determined by a multitude of so-called chance causes pulling this 

way and that, so that it is impossible to foretell what the resultant 
effect will be. 

8. The faUure of the normal curve to describe other common dis¬ 
tributions, especially those which are unsymmetrical in character, 

leads to the development of skew varieties of curves which will 
fit them. 

9. The extent of connection between one set of data and a pos¬ 
sibly related set is a natural subject for inquiry giving rise to the 
theory of correlation. 



CHAPTER n 


irEASTTREMEKT. VARIABLES, AND FREQUENCY DISTRIBUTION 

Measurement. There are two fundamental characteristics which 
pertain to nearly all measurement: it is (1) relative : it involves 
a comparison between one magnitude and another of the same kind, 
and (2) approximate : the comparison in practice cannot be made 
with absolute exactness. 

A man’s height, for example, is stated to be 5 ft. in., but this 
would convey little to one who did not know how long a foot was 
and how long an inch was. The first step in the measurement i.s 
made by comparing the man’s length with a certain constant 
length previously agreed upon as a standard or unit, namely, a 
‘ foot ’; he is placed to stand up against a scale which is divided 
up into feet, and the highest point of his head is seen to come 
somewhere between the 5 ft. line and the 6 ft. lino : he is there¬ 
fore longer than five of these units, set end to end, but not so long 
as six of them. To carry the measurement a stage further a smaller 
uni t, has to be introduced ; each foot length of the scale is sub¬ 
divided into twelve equal parts called inches, and the top of the 
man’s head is found to come somewhere between the 6 ft. 8 in. 
line and the 5 ft. 9 in. line: he is therefore over 5 ft. 8 in., but 
not quite 6 ft. 9 in. in height. For the next stage in the measure¬ 
ment each inch of the scale has to be further subdivided into quarter- 
inches, and the top of the man’s head is found to come somewhere 
between the 5 ft. 8 in. 3 qu. in. line and the 5 ft. 9 in. line ; more¬ 
over it is nearer, let us suppose, to the former line than to the latter. 
In this case, then, we say that the man’s height or length is 5 ft. 
8| in., measured to the nearest quarter inch. 

In measurement the decimal notation has very obvious advan¬ 
tages, because each unit is always divided into ten equal parts to 
get the next smaller unit. Thus a weight of 7 kilogr. 6 hectogr. 
3 decagr. 8 gr. 4 decigr. 3 centigr. can be expressed at once in 
grammes, namely 7538*43 gr. ; hence if we were measuring to the 
nearest decagramme, the result would be expressed as 754 decagr. ; 
to the nearest decigramme, it would be 75384 decigr., etc. 
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Similarly, a length of 12 kilom. 7 metres 2 centim. can be written 
12007-02 metres, or, in kilometres, 12 00702 kilora., or, to the nearest 
decametre, 1201 decara., and so on. 

The mere act of counting things of a like kind is, in a sense, 
measurement of a primitive type, one thing being the unit, though 
the measurement may in many such cases be exact; for example, 
we may count the number of persons in a room exactly. Even in 
this type of case, however, the counting or measuring cannot 
always be done accurately, but the inaccuracy arises from lack of 
precision and uniformity in definition rather than from want of 
power in the measuring instrument itself; c.g. in determining the 
population of a city, inaccuracies may arise because of failure to 
define exactly the boundaries of the city, or the time at which the 
census is to be taken, or how to deal with the migration of the in¬ 
habitants from or into the city, and with births and deaths during 
the actual time of numbering. 


Variables. By a variable is meant any organ or character which 
is capable of variation or difference in size or kind. The difference 
may be measurable as in the case of head-length, height, tempera¬ 
ture, etc., or not directly measurable as in the case of colour, intelli¬ 
gence. occupation, etc. Further, the variation, when measurable, may 
be continuous, or it may take place only by integral steps, omitting 
intermediate values: population, for example, can never go up or dow-n 
by less than one, but if temperature is to change from 60 degrees to 
61 degrees it must pass continuously through every intermediate 
state of temperature between 60 degrees and 61 degrees. 

n dealing with a measurable variable sometimes we are inter- 
ested not so much in its actual value at a particular instant as in 
the change which has taken place in its value during some specified 
mterva , but to gauge fairly the amount of this change it is necessary 
to mcMure it relative to the original value of the variable. For 
example, if we are told that the wages of a certain person have 
gone up during the year to the extent of 3d. an hour, we cannot 
say whether this is much or little to him until we know what his 
waps were originaUy. The addition would be relatively much less 
1 he were a skilled patternmaker earning Is. 6d. an hour than it 
would be If he were a chainmaker earning only 6d. an hour.* This 
point can be met by stating, not simply the change in the value of 
the vanable. but the ratio of the new value to the old. For instance, 

he patternmaker m the above instance has had his wages increased 

F w.g« to-d.y ot ra«ch higher-the .bov. flgar« bypoth.ti^I.] 
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in the ratio of la. 9d. to Is. 6d. It is important to notice that 
this form of measurement is quite independent of tlie particular 
units used ; if we take Id. as unit, the ratio=21/18—7/6, and if 
we take Is. as unit, the ratio=l|/li=7/6 just as before. 

There are other ways of measuring this change in the value of 
a variable. One of the commonest is to express it as a percentage 
of the original value; thus the patternmaker’s increase is at the 
rate of i\xl00, or 16§ per cent., which is simply the ratio of 
increase in wage to previous wage multiplied by 100. The multiplier, 
100, is quite an arbitrary factor, but it has obvious advantages: among 
others, it works well with the decimal notation and it often serves 
to put the result into a form which is greater than unity instead of 
leaving it as a fraction. Again, a man who gets a dividend of £25 
on an investment of £500 receives interest at the rate of fovX 100, 
or 5 per cent.; in other words, this is the rate at which his capital 
accumulates if the interest is added to it instead of being spent. 

Anniifl.! birth rates and death rates, on the other hand, are best 
expressed per thousand of the population, as estimated, say, at 
the middle of the year in question ; e.g. the birth rate of the United 
Kingdom in 1911 was 24*4 per thousand, and the death rate w’as 
14-8 per thousand, which is equivalent to 244 and 148 per 10,000 
of the population respectively. If we could assume the birth 
and death rates to remain constant from year to year, and if we 
could afford to leave migration out of account, the population 
would be subject to exactly the same law of increase as capital 
accumulating at compound interest [see Part II, p. 263], thus:— 

1. If P be the original population, and if the annual net increase 
be at the rate of 25 per thousand, then 

the population in 1 year’s time=Px (1*025) 

.. 2 „ =Px(l*025)> 

. 3 =Px(l*025)» 

„ „ n „ =Px(l-026)". 

2. If £P be the original capital, and if the annual increase be at 
the rate of 2^ per cent., then 

the capital in 1 year’s time=Px (1*025) 

„ 2 „ =Px (1 025)2 

M 3 „ =Px (1*025)2 

.. „ n „ =Px (1*025)". 

Lest we may seem to have laboured to make plain what is really 
a simple idea, it may be remarked that quite frequently confusion 
arises with regard to percentage even in reputable quarters. As an 
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illustration of the kind of mistake which, without tliinking, is easilj 
made, the following argument has been taken from a monthly 
circular sent out a little while ago to the Jiierabers of the Boiler¬ 
makers’ Society by their Secretary: Since July 1914, wages have 
risen 15 per cent., the cost of living has gone up 45 per cent., therefore 
the workers' real wages have fallen 30 per cent. This same argument 
was quoted shortly after in one of the leading articles of The Man¬ 
chester Guardian under the heading ‘ Prices and Wages,’ and again 
in The Labour Leader tersely as truth ‘ In a Nutshell,’ but in 
neither instance did it seem to have occurred to the writer that it 
was inaccurate. It may be worth while for the sake of clearness to 
show what the statement should have been :— 



Wagc«. 

Cost o! 
LiviDg. 

Uatio of Wag^A to 
Cost of Living. 

Same Ratio 
multiplie<] by 100. 

July 1914 . 

100 

100 

1 

100 

October 1916 

115 

145 

iH 

79 


Since ‘1 r X100 is roughly 79, this calculation shows that ‘ real 
wages’ had fallen only about 21 per cent. (100—79=21), and not 
80 per cent, as stated, between the two dates. 

Index Numbers. A very important case of variables changing 
with time appears in the discussion of changes in the value of 
money as measured by the movement of prices of commodities, 
introducing the notion of an index number. For example, supposing 
the wholesale price of beef was 6d. a lb. at one date, 8d. a lb. at 
another date, and 5Jd. a lb. at a third date, the change might be 
exhibited as in the following table :_ 



let Date. 

1 

1 

2iid Date. 

3rd Date. 

Price of beef 

*9 « 

6i. 

100 

Sd. 

133 

5ld. 

92 


Here 100, 133, and 92 are called index numbers, the price at the 
first date being taken as a standard and denoted by 100, while 
the prices at the other two dates are altered proportionally, so that 


6:8:5i=100: 133:92. 


Index numbers calculated on this principle have been published 
fljstematicaUy for several years by Mr. A. Sauerbeck (in the Journal 
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oj ihi Royal Statistical Society up to January 1013, and continued 
afterwards in The Statist under the supervision of Sir George Paish) 
and in The Economist. 

In Sauerbeck’s index numbers the average wholesale prices of 
forty-five commodities for the eleven years 18G7-77 are taken as 
the standard, being denoted each by 100 as above, and the prices 
of the same commodities for any other year are then wiitten as 
percentages of these standard prices. The commodities chosen arc 
various—food of all kinds (cereals, meat, potatoes, rice, butter, 
sugar, coffee, tea), minerals (including coal), textiles, and sundries 
(including hides, leather, tallow, i)alm oil, olive oil, linseed, 
petroleum, soda, soda nitrate, indigo, timber). Articles of similar 
character are grouped together ; naturally no class is exhaustive, 
but the selection is a fairly representative one. A sort of general 
average is then formed by combining all the results, and the move¬ 
ment of this average is taken to measure changes in the value of 
money. An example will make clear the way in which an index 
number for each group and the general average are obtained. 

The index number for each separate commodity may be first 
calculated thus:— 


Pbicb of Enoush Wheat. 


Yeard. 

Price per 

1 Quarter. | 

Ititlex Number. 


8. d. 


1867-77 

54 6 

100 

1912 . 

34 9 

64 


Now forming similar index numbers for each of the eight vegetable 
and cereal foods and combining them together, we have :— 


Index Numbers fob Vegetable and Cereal Foods. 


Yeart. 

a 

1 

American 

1 Wheat. 

tel 

♦ 

n 

♦ 

s 

n 

o 

4 

N 

'S 

A 

1 

1 

• 

m 

o 

< 

o 

1 

♦ 

9 

o 

s 

Eight 

Commodities. 

9 

bo 

9 

9 

> 

< 

1807-77 . 

1 1912. 


100 

68 

100 

70 

100 

79 

100 

83 




800 

624 

100 

78 

t 


The figures in the last column but one are obtained by simply 
adding the figures in the eight previous columns, and, dividing these 
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results by eight, we get the average index number for the group 
in 1912 as a percentage of that in the standard years 1867-77. 

Treating all the other commodities in the same way we ultimately 
get index numbers for all the different groups and for all com¬ 
modities combined as follows :— 


Index Numbers for different Groups and 

FOR ALL Commodities. 


No. of CcmmodltleB 

1 

8 

7 

4 

1 

7 

8 

1 

11 

49 


« • 

t 

4 

eS 

V 

♦ 

♦ 

ir> 

♦ 

m 

% 

» 

m 

0 

Yeftri. 

£ 5 

tc 

is "S 

>S 

a 

A 

Sugar, 
Coflfee, T 

£ 

*3 

u 

it 

.2 

a 

Textile 

’u 

ns 

a 

a 

CQ 

< § 

I 

V 

1867-77 

100 ' 

100 

100 

H 


100 

100 

100 

1912 .... 

78 

1 

96 

62 

1 81 


76 

1 

82 

85 


The index number for ‘ All Food ’ is obtained by summing the 
nineteen index numbers for the separate commodities which are 
included in this class and dividing the result by 19. Similarly the 
general index number for all commodities is obtained, not by 
adding the numbers for the different groups and dividing by the 
number of groups, but by adding the forty-five index numbers of 
all the separate commodities and dividing the result by 45. 

In 2'he Economist the average prices of fifty-eight commodities 

for a selected year are taken as the standard, being denoted each 

by 100, and the prices of the same commodities for any other year 

are written as percentages of the standard prices; the unweighted 

geometric mean of these percentages is taken as the index 

number, and it is simple to calculate by use of logarithms. The 

following table provides material to illustrate the method of 
calculation:— 


Price Index, June 2nd, 1937, as Percentage op 

Mean Price Level in 1927. 



Cereals 
and Meat. 

Other 

Foods. 

Textiles. 

Minerals. 

Misceh 

aneous. 

Total. 

No. of Items 
Price Index . 

13 

93-6 

9 

68-6 

11 

73*2 

11 

1100 

14 

86-7 

58 

86-2 
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In the table the importance of each group is determined by tlie 
number of items included. Also, the arithmetic mean of the logs 
of the price indices for the separate items in any group is the log 
of the price index for that group. 

It is clear that what is at bottom the same principle may be 
applied in any case of a variable changing with time when we wish 
to measure the extent of the change, so that the use of index numbers 
is not confined to the problem of prices. We shall return again to 
discuss one or two further points in connection with the same 
subject in the Chapter on ‘ Averages.’ 

PreQuency Distribution. So far we have been thinking more 
particularly of the change wliich an individual variable, or a col¬ 
lection of such variables, may undergo in the course of time, or the 
difference between two values which the same variable may have 
at two different instants of time, and how to measure it. Now 
the science of Statistics is based upon the study of the crowd 
rather than of the individual, although observations on individuals 
have to be made before they can be combined together to produce 
the crowd, just as individual income-tax schedules have to be 
completed and combined before the balance-sheet of the State can 
be drawn up. As we pass from one individual to another there 
may bo great differences in the organ or character observed—hence 
the word variable already introduced—but in the mass these differ¬ 
ences are merged together and lose their individual importance: 
it is rather their resultant effect we seek to measure. In order 
therefore to discover this effect it is necessary to make a collection 
of individual observations and to analyse the results. Now if our 
ultimate conclusions are to be safe the number of observations 
must be considerable, and in order to be able to cope with them 
and reduce them to some sort of system the first step in the analysis 
consists in arranging them in different classes according to the 
value of the variable under consideration. 

It is to be noted that now we are dealing with changes in the 
value of a variable as we pass from one individual to another at the 
tame period of time and under the same general conditions, and not 
with the change in a variable in the tame individual occurring with 
the lapse of time. We wish, for example, to draw a distinction 
between (1) the change in wages as we pass from one man to another 
at the same time in the same trade, and (2) the change in wages of 
the same man, or class of men, in the same trade occurring in a 
given period of time ; in the first case we want to find the amount 
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of diversity withia the trade at some stated time, and in the second 
our object is to discover whether an improvement has taken place 
in the wages of a particular individual or a particular trade with 
the passage of time. 

In picturing variation of the first type the conception arises of a 
frequency distribution where the observations are distributed in 
ordered groups, with a number corresponding to each showing 
how many, or how frequent, are the individuals possessing the type 
of variable or character which defines that group. More generally, 
if a series of measurements or observations of a variable y are 
made corresponding to a selected series of another variable x we 
get a distribution, which becomes a frequency distribution when y 
represents the frequency of events happening in a particular way, 
or of individuals corresponding to a particular value of some 
common variable or character, represented by x. Thus (1) the 
boys in a school might be grouped according to their intelligence: 
30 many, dull; so many, of ordinary intelligence; and so many, 
bright or above the ordinary. Again (2) in an inquiry into the 
housing of the people in any town or district it would bo necessary 
to draw up a table showing the number or frequency of existing 
tenements with one room, the frequency of tenements with two 
rooms, the frequency of tenements with three rooms, and so on. 
Once more (3) a zoologist, wishing to discover whether crabs of a 
certain species caught in one locality differ in any remarkable way 
from members of the same species caught in another locality, might 
start by maldng measurements of the length of carapace or upper 
shell for crabs of like sex in the two places and then proceed to 
form frequency tables for each, setting out the frequency of crabs 
for which the carapace length lies, say, between 5 and 6 millimetres, 
the frequency with length between 6 and 7 millimetres, the frequency 
with length between 7 and 8 millimetres, and so on. He would 
then have in these tables some basis for comparing the specimens 
caught in the two localities. 

The three illustrations just used give three different types of 
distribution corresponding to the three types of variable to w’hich 
attention has been drawn before. In the first, where the variable 
or character observed is not measurable, doubt wdll sometimes 
arise as to the appropriate class in which individuals should be 
placed who seem to be on the border line between dulness and 
mediocrity or between mediocrity and brilliance, so that accurate 
classification will greatly depend upon what is caUed the ‘ personal 
equation of the observer. The second illustration correspond# 
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to the case where the variable changes not continuously but by 
unit stages ; the choice of classes in such a case depends Uttle 
upon the observer unless the unit is very small compared to the 
total range of variability ; for example, a tenement might either 
definitely have two rooms or it might have three rooms, but it 
clearly could not be put down as having 2J rooms or 2;:; rooms ; 
in other words, the only natural classification is so many tenements 
with two rooms, so many with three rooms, so many with four 
rooms, and so on, though here too some confusion might arise 
through failure to define clearly what is ‘ a room.’ In the third 
typo, where we can conceive of the continuous variation of the 
character under observation, there would be nothing surprising in 
the appearance of any value of the variable between Uie lowest 
and highest values observed ; the choice of suitable limits for the 
several groups becomes therefore in this case ratlier a delicate 

matter which requires careful judgment. 

We shall begin the ne.xt chapter with some general remarks 

upon the subject of classification and te.bulation. 



CHAPTER III 

CLASSIFICATION AND TABULATION 

No part of Statistics is of more importance than that which deals 
with classification and tabulation, and it is the one part for which 
no very precise rules can be given. A neat arrangement of ideas 
in the mind, capacity to express them clearly, and patience are 
indispensable, but experience alone will conv'ince one of the extreme 
care which must be exercised if blunders are to be avoided and 
time is to bo saved in the long run. This has to be emphasized 
because most people, until they have tried and failed, imagine 
that to arrange things in classes and in tables is a straightforward 
proceeding involving no great thought or trouble. 

Abundant matter of a statistical character is published periodi¬ 
cally in Blue-books, Government Reports, Reports of Local Authori¬ 
ties, Directors of Education, Medical Officers of Health, Chief 
Constables, Employers’ Associations, Trade Unions, Co-operative 
Societies, etc., but it needs a trained intelligence as a rule to assimi¬ 
late it and turn it to further advantage. The larger the scale upon 
which any inquiry is made, the more valuable should the results 
be, granted that equal accuracy is possible on the large as on the 
small scale, but it is fairly clear that mistakes of various kinds 
have also much more chance of creeping into a large work than into 
a small one. To appreciate the various and numerous possibilities 
of error when the scope is wide it is enough to read the introduc 
tions to the Registrar-General’s Reports on the Census from decade 
to decade ; this should also impress the student with the care that 
is necessary if he proposes to use such material for the investigation 
of some other problem. It may seem a comparatively simple task 
to abstract two sets of figures from a Census Report, to establish 
a one-to-one correspondence between them, and to make deductions 
therefrom, but such figures when taken from their context will 
sometimes lead to absolutely unsafe, if not false, conclusions. The 
exact meaning and limitations of any data can only be properly 
appreciated by one who has been closely in touch with the persons 
who have collected them, and it is therefore important, before 
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attempting to re-classify or re-tabulato any old statistics for a new 
purpose, to read very carefully through the notes made by the 
original compilers. 

Perhaps the best advice that can be given to any one in this 
connection is that he should embark upon some small inquiry 
which will necessitate the collection of statistics for himself ; the 
final result of his efforts may seem disappointing, but the experi¬ 
ence he will gain will be invaluable. Ideas for such an inquiry will 
occur to him if he reads through some authoritative work on social 
questions, e.g. Beveridge’s Unemployment, the decennial Censxis 
Reports, or The Minority Report on the Poor Law (1905). But he 
must read with an open and critical mind, questioning particularly 
the foundation for all statements as to cause and effect which may 
be made. A few simple hints may be useful as to method of 
procedure. 

When he thinks he has discovered some subject of interest which 
would appear to deserve examination, it will be well to put it 
down on paper in order to get it clearly defined, because a precise 
written statement is likely to carry one further than a shadowy 
idea somewhere at the back of the mind which is hardly formu¬ 
lated at all. When the actual collection of statistics is begun 
it will almost certainly be found that it is impossible to solve the 
original problem contemplated ; but that need nob prevent further 
progress—what is important is that the limitations should be 
exactly realized, and this will be impossible unless the original 
problem is clearly presented side by side with the nearest solution 
obtainable. 

The problem stated, the next thing is to set down categorically 
a number of questions, the answers to which are to be the raw 
material for the solution of the given problem. For the answers 
let us assume the inquirer is dependent upon the goodwill of others, 
either employers, or trade union secretaries, or public officials. 
The questions in that case must be clearly, concisely, and courteously 
phrased, and must not be capable of more than one interpretation. 
In number they should be few and in character not inquisitorial; 
moreover, the replies should be obtainable without any great labour 
on the part of the persons approached. Here again it will be found 
that the questions first set down are not all satisfactory ; one will 
be too vague; another, though clear enough, may involve a con¬ 
siderable search through a mass of other matter before it can be 
properly answered ; while to another it might be impossible to give 
an exact reply in any case. Revision and amendment may there- 
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fore be necessary in the light of the first repUes lecelved, and the 
inquirer will begin to see at this stage how tar the solution to his 

original problem is really possible. 

When the bulk of the returns have come in they should be critically 
examined one by one. A number will, for one reason or another, 
be worthless, and they must be discarded ; as for the remainder, 
if the questions wore weU chosen, the answers should not be difficult 
to interpret and classify ; the most successful questions are those 
to which a simple ‘ yes ’ or ' no ’ in reply gives aU the information 
required ; numerical answers are less easy to deal with, especially 
if there is the least chance of misunderstanding on either side as 
there often is, for example, in the case of observations which are 

on the border line between two classes. 

Tables should then be drawn up and the headings to the different 
columns of the tables should state concisely and exactly what the 
figures below represent. So far as possible any one should be able 
readily to grasp their general meaning without being obliged to 
wade through a page or two of written explanation ; if any heading 
cannot be clearly expressed in a few words it may be helped out 
by a further note at the bottom of the page, but too many such 
notes are to be avoided. 

Finally, a summary should be made of the various conclusions 
suggested by a study of the tables. Some of the points raised in 
the course of the inquiry will perhaps be only incidental to the 
main problem under discussion, but may still deserve a passing 
reference. It will also be of advantage to follow up the summary 
by any recommendations which can be fairly based on the con¬ 
clusions obtained, when the problem is such that recommendations 
are expedient, and, if ultimately the whole is of sufficient value to 
be printed, emphasis can be introduced where necessary by suitable 
variations in type. 

For this part of the work considerable judgment is necessary 
which can only be acquired by long training—a faculty to pick out 
the real from the false and an eye to distinguish the important from 
the trivial. A sense of numerical proportion too is desirable inci¬ 
dentally ; one of our leading exponents on finance in a book dealing 
with the meaning of money uses a very interesting illustration which 
is perhaps w’orth quoting here to show how even an acute mind 
may on occasion prove itself curiously lacking in such a sense. 
He is seeking to show how the credit system of the country is built 
upon a foundation composed of a little gold and a lot of paper; 
for this purpose he amalgamates together the balance-sheets of half 
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a dozen big banks, and proves that their liabilities on current and 
deposit account amounted at a certain date prior to 11)14 to 249 
million pounds, while the cash in hand and at the Bank of England 
was 43 millions. Of the 43 millions he estimates that roughly 
20 millions would be cash in the Bank of England, and further 
that about two-thirds of tliis 20 millions would be represented roally 
by securities and not by gold. Hence he concludes that to support 
this vast erection of credit there would only be £6,666,006 of actual 
gold. Thus after talking throughout in millions the author closes 
by giving his result true apparently to a pound ! 

Much may be learnt as to methods of classification and tho 
drawing up of tables by a careful study of those which appear in 
various official reports, and a few such tables are reproduced in 
the pages which follow. 


Table (1). Condition as to Cleanliness of 
School Children in Surrey. 


Clcanlinene. 

6 rears, 1008 12. 

7U,070ehildre& inspected. 

Above the average , 

15-4 

per ceot. 

Average 

76-5 


Below average 

7-6 

>» 

Much below average 

05 



Table (2). Condition as to Infectious Dlseases of 
School Children at Different Ages in Surrey (1913). 


Age Oroupe Inipectisd 

b-6 

8-9 

18-14 

Total at 
All 

Numbers inspected 

6,191 

6,161 

4.9G2 

15.3M 

Proportion who before inspec¬ 
tion had suffered from— 
Diphtheria . 

Scarlet fever « * 

Measles « . • 

Whooping cough * 
German measles » 
Chicken pox • • 

Mumps . . • 

No infectious diseases • 
No definite information 

per cent. 

1- 3 

2- 7 

55-3 

41-8 

2- 9 

261 

10-6 

18-9 

3- 3 

per cent. 
35 

7-2 

79-3 

56-4 

61 

40-1 

220 

61 

2-2 

per cent. 
54 

10-9 

84-6 

64-3 

7-6 

38-6 

20-8 

4-7 

0-9 

per cent. 
3-4 

6-9 

72-9 

60-9 

51 

34-9 

20-7 

10-0 

2*2 


B 
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Table (3). Height of School Children according to 
District, Age, and Sex (1913). 


Ao2 

GROCPt. 

Boys. 

Qirlb. 

0 

1 

No9. 

measured. 

1 

Average 
Height 
in iuebea. 

Average Height 
in cma. 

Noa. 

measured. 

Aver.ngo 
Height 
in inches. 

Average Height 
in oms* 

Surrey. 

England 

and 

Wales. 

Surrey. 

England 

and 

Wales. 

5-6 

2724 

41-4 

105-2 ' 

1034 

2467 

41-3 

104-9 

102-6 

8-9 

2578 

47-8 

121-4 

120-4 

2573 

47-5 

120-7 

119-4 

13-14 

1 2529 

570 

144-8 

142-4 

2433 

67-9 

147-1 

144-2 


The first four are taken from the Annxial Report of the School 
Medical Officer for the County of Surrey, 1913. The first is an 
example of single tabulation showing the distribution according to 
cleanliness of children inspected in the elementary schools. The 
second is an example of double tabulation, showing the distribu* 
tion according to age of school children who at some period before 
the date of inspection had suffered from certain infectious diseases. 
The third is an example of quadruple tabulation, showing the dis¬ 
tribution of school children according to height, district, sex, 
age. Thus in the first case we have one factor brought into relief, 
viz. cleanliness; in the second case we have two factors, age and 
disease; in the third case we have four factors, height, district, 
sex, and age. 

When we have two or more factors tabulated together as in cases 
(2) and (3), we may be sometimes led to discover a connection of 
some kind, possibly causal, between them, and the search for such 
a connection, or correlation as it is called, represents one very useful 
purpose to which tabulation may be put. Table (4) is an illustra¬ 
tion of this. It is the result of certain measurements carried out m 
order to discover the effect of employment out of school hours upon 
the physical condition of boys. The particular factor examined as 
the possible cause of evil in this connection is lack of sleep, 
the figures given certainly seem to warrant a closer examination 
into the matter. 
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Table (4). Physical Condition oe certain Boys according 

TO Hours of Sleep Obtained. 


No. of Hours 
Sleopobtaiued. 


Average 
Height in 
inebes. 

Average 
Weight in 
lbs. 

Nutrition. 

No. ofBojs 
examined. 

Perceutage 

above 

average. 

Percentage 

average. 

1 Percentage 
below 
average. 

7 to 8 . 

14 

54 5 

71-3 

7-1 

35-8 

57-1 

8 to 9 . 

80 

65-4 

73-9 

10-1 

65-9 

240 

9 to 10 . 

296 

56-4 

79-3 

15-3 

64-5 


10 to 11 . 

280 

57-9 

83-2 

22-8 

66-5 


11 to 12 . 

60 

590 

870 

1 

1 22-0 

68-0 

100 


Tables (5) and (6) are two illustrations of neat tables, containing 
a large amount of information in a small space, set out in such a 
form that the eye can easily take it in—and that is the main purpose 
of tabulation. These examples are selected from the Sixteenth 
Abstract of Labour Statistics of the United Kingdom, Cd. 7131. 

In Table (6) note the classification of age groups : it is not ‘ 5 to 
10 years,’ ‘ 10 to 16 years,’ and so on, but ‘ 5 and under 10 years,’ 
‘ 10 and under 15 years,’ and so on. This removes difficulties at 
the border lines between two classes ; the difficulties are not com¬ 
pletely removed, however, unless there is some understanding as 
to what shall constitute under any particular age. Shall it be six 
months under, or one day under, or one hour under ? This sort 
of ambiguity has more importance in some cases than in others. 
Suppose, for example, we were classifying men according to their 
height: a group of the type ‘ 60 inches and under 62 inches,’ 
assuming that measurements were made to the nearest half-inch, 
would really include all men who were ‘ 69^ inches and under 
61} inches ’; because one who measured anything from 69} in. 
to 60} in., being nearer to 60 in. than to 69} in. measuring to 
the nearest half-inch, would be registered as 60 in. in height, while 
one who measured anything from 61} in. to 62} in., being nearer 
to 62 in. than to 61} in., would be registered as 62 in. in height. 

Another point to be noted is that in general people making 
returns seem to have a psychological weakness for roimd figures, 
so that a man in the neighbourhood of 40 years of age, for example, 
is apt to record himself as actually 40 although he may really 
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Table (5). Classification of Overcrowded Tenements—• 

England and Wales (1911). 


TIKSM6NT9 

WITH 

Ukba.s D I ST hi era 

R(’HAt District?!. 

1 

Totau 

No. of 
Over- 
ernwfied 
Teoe- 
mentis 

Occui'onts 

thereof. 

Na of 
Over* 
rrowiJed 
Tviie- 
I meott. 

Occupants 

thereof. 

No. of 
Ovor- 
crowded 
Teno- 
nieuta. 

Oceupaota 

thereof. 

No. 

: Ter. 

1 cent* 
age of 
total 
popu¬ 
lation. 

No. 

Per* 
cent* 
age of 
total 
popu¬ 
lation. 

No. 

Per¬ 
cent* 
age of 
total 
popu¬ 
lation. 

1 room . 

58,290 

200,022 

0-7 

1..545 

6,748' 

0-1 

57.835 

211,770 

06 

2 rooms . 

n9,6l):-| 

712,013 

2-5 

15, .397 

91,458 

1-2 

13.5.092 

804,071 

2-2 

3 rooms . 

107.81I2 

847,037 

30 

22,;i80 

175,988 

2-2 

130.272 

1,023.925 

08 

4 rooms . 

64,470 

024,747 

2-2 

17,341 

167,969 

2-1 

81,811 

792,716 

2*2 

5 or more 




1 

j 




1 


rocma . 

1 

21,200 


0-9 

I 4.700 

55,585 


' 25,900 

306,990 

m 


Table (6). 


Population grouped according to Age— 


England and Wales (1911). 


MALES. 


Aon Onoim. 

UftaAH Districts. 

i 

Number. 

Percentage. 

Under 5 years 

6 and under 10 years 
10 15 „ 

16 .. 20 

20 30 .. 

; 30 ,. 40 ., 

40 .. 60 „ 

60 .. 60 

60 „ 70 „ 

70 and upwards 

1.517.432 

1,431,900 

1,341,586 

1,267.500 

2,3:12.135 

2,094.934 

1.556.818 

1,042,868 

612,741 

296,246 

11-3) 

10- 6 .19 

9A 

17-31 

15-5 U4-4 

11- 6] 

7-7S 

4 5 U 4-4 
22 J 

Total 

13,494,160 



Robal DtsTfiicra. 


Namb«r. 


418,681 

415.395 
406,045 

387.395 
626,300 
642.370 
444,360 
333,368 
230,306 
147,228 


3,951,448 


PercentAge. 



100-0 


All District!. 


Namber. 


1,936,113 
1,847,295 
1.747.631 
1,654,895 
2,958,435 
2.637,304 
2,001,178 
1,376,236 
843,047 
443,474 


17.445,608 


P!rc«nUg!. 



15-2 


100-0 


* For the pur 
than two occup 
overcrowded. 


Report * ordinary tenemente which have more 

>■ included,' are considered 
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be 39 or 41 years old. To diminish the error arising from this fact 
it is usual, when not otherwise inconvenient, to fix the centres 
of the class-intervals at round figures : e.g. to take ‘ 15 and under 
25 years,’ ‘ 25 and under 35 years,’ etc., in preference to ‘ 20 and 
under 30 years,’ ‘ 30 and under 40 3’ear3,’ etc. Where there is 
any kno\\Ti bias in the data, as, for instance, in the familiar case 
of certain women who consistcntlj’ register themselves as younger 
than they really are, a correction can be made in the final figures. 

In any frequency distribution where we wish to group a number 
of observations according to the magnitude of some common 
variable, as in Table (6) a number of males grouped according to 
age, the question arises—‘ How many groups should there be ? ’ 
With this question is involved also the size of the corresponding 
class-interval, and this should be so large that, with possible excep¬ 
tions at either extremity of the table, there are a fair pioportion of 
observations to each class or group ; and, contrariwise, it should 
be so small that all the observations in any one group may be 
treated practically as if they were located at the centre of the group 
so far as the variable in question is concerned, e.g. it should be 
possible to treat males recorded in class ‘ 50 and under 60 years,’ 
where the interval is 10 years, as if they were all of age 55 years. It 
will be found in general that a number of groups somewhere in the 
neighbourhood of 20 is the most satisfactory, granted that the 
number of observations is reasonably large, although in some cases 
it is impossible to split up the unit of class-interval, and we are 
obliged to be satisfied with a smaller number of groups on this 
account: Table (5) is a case in point where we are tied down to 
one room as the class-interval. In Table (6) the class-interval 
varies, being only 6 years at first, and afterwards 10 years, but 
as a rule the labour of calculation of the different statistical constants 
we require is considerably simplified if it is possible to keep the 
size of the class-interval the same for each group. 


CHAPTER IV 


AVERAGES 

Common Average or Arithmetic Mean. Let us consider one of the 
commonest meanings of the term average. If a train travels a 
distance of 180 miles in 3 hours we say that it has been moving 
at CO miles an hour. By this we do not mean that its speed is 
always 60 m/h, never more, never less, but that if it had moved 
always at that uniform speed it would have accomplished its 
journey in exactly the same time. As a matter of fact, during 
some instants it may have been moving at a much slower rate 
than 60 m/h, but, if so, it must have made up for this slackness 
by travelling at a much faster rate than 60 m/h during other 
instants, so that on the whole a balance was effected, and, as we 
say, the speed averaged out at 60 m/h. 

Again, suppose the wages of three men are : A, 27s. a week; 
B, 18s. a week ; C, 30s. a week. We should say that the average 
wage of the three was equivalent to 

J(27+18+30)8.=26s. a week. 

In other words, if A, B, and C were all under the same employer, 
and if, instead of paying them different amounts, he wanted to 
pay them all equally, he would have to give each man 26s. a week, 
assuming that his total wages bill was to remain unaltered. This 
method of measurement gives what is known as the arithmetic 
mean, or, more simply, the mean. 

Once more, in discussing the state of the labour market as regards 
different trades, when we wish to compare one with another, it is 
not the actual numbers unemployed in each trade that are quoted, 
but these numbers expressed as percentages of the total numbers 
employable in each trade. 

In each of these three cases we reduce our observations or 
measurements to a sort of common denominator, so that they may be 
mentally compared or contrasted more readily with other observa¬ 
tions of a similar character. Thus we have in mind a certain mean 
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train speed per hour, or mean wage per week, or mean percentage 
out of work, as the case may be. 

An average then in general wo may regard as one of a class 
of statistical constants (others of wliich we are to meet later) which 
concisely label a set of observations or measurements pertaining 
to a common famil}’. It is designed to describe the family type 
more nearly than is possible by observing any chance member, and in 
value it should therefore come somewhere near the middle of the 
family group, so that if the individual members of the family 
chance to be equal each to each in respect to the organ or character 
observed it should have the same value as they have. This consti¬ 
tutes a test for the validity of any formula giving the average of a 
set of observations : e.g. we might, if we vish, define the average 
of three numbers, p, q, r to be, not but 

for (1) this formrxla, too, can be shown to give a number intermediate 
in value between the greatest and least of the numbers p. g, r; 
also (2) if we put p=q=r=k (say), the formula reduces to 

Clearly the range of choice for the definition of an average is 
infinite, though only a few definitions give averages which have 
proved their utility and come into general use. Of these the most 
important is the common mean already introduced, with its ex¬ 
tension, the weighted mean, but at least two others deserve special 
consideration, the median and the mode. 

Median. In any observed distribution if all the individuals 
can be arranged in order of magnitude of the character or organ 
observed, which may be conveniently done when they are not very 
numerous, the median organ or character will be that pertaining to 
the individual half-way along the scries, so that there are in general 
an equal number of individuals above and below the median. 
For instance, if seven boys of different heights be placed to stand in 
a row, the tallest first, the next tallest next, and so on, the median 
height is the height of the fourth boy from either end. If there 
are an even number of boys, say eight, it would be natural to take 
as median the height midway between that of the fourth and that 
of the fifth boy. 

When the items are numerous they are frequently grouped into 
clasBM, as we have seen, such that all in the same class are reckoned 
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to have some value h'ing between the extreme limits of that class. 
\Ve should then, as before, halve the total number of observations 
to fix the particular individual which defines the median organ or 
character. This would enable us to pick out the group in which 
the median lies, and on reference to the original record of observa¬ 
tions, assuming it was at hand, it would be a simple matter to 
identify the median. 

If the original record be not available, however, it will be neces¬ 
sary to proceed to get the best value we can for the median in some 
other way. Consider, for example. Table (7), showing the distribu¬ 
tion of marks obtained by 514 candidates in a certain examination. 
VVe begin by rearranging the data in the manner shown below 
Tabic (7). Now in accordance with the definition the median in 
marks should, strictly speaking, be midway between the marks 
assigned to the 257 th candidate and the marks assigned to the 
258th candidate : in fact, the marks corresponding to candidate 
number 257-5, if it were possible for such a candidate to exist. 
But we are ignorant so far as Table (7) goes of the marks gained 
by either the 257th or the 258th candidate, though it is possible, 
by the simple proportional process known as ‘ interpolation,’ to 
calculate approximately the marks we require. We think of all 
the candidates as forming an ordered sequence, ranged one after 
the other according to their marks just like the boys of different 
heights, and the table shows that in this mental picture 

the 23l8t candidate gets approximately 30 marks, while 
•. SlSth „ „ „ 35 

Hence candidate number 257-6, if one existed, ought to get a 
number of marks somewhere between 30 and 35. But, in tliis 
neighbourhood of the sequence, 

a difference of (318-231) candidates corresponds to a difference 
of 5 marks, therefore 

a difference of (257'5-231) candidates corresponds to a difference 
of (^x26-5) marks. 

Thus the marks obtained by candidate number 257-5 are ap¬ 
proximately =30+5*7x26-5 

=31-523, 

and this may be taken as the median. 

On examimng the actual marks-sheet it was found that 252 
candidates obtain d 31 marks or less, and 273 candidates obtained 
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32 marks or less, so that tho real median was 32, because this was 
the number of marks gained by both the 257tli and the 258th 
candidates. The number 31-523 found above, however, would be 
a good approximation to take for the median when all the informa¬ 
tion at our disposal was that shown in Table (7), 


Table (7). Marks obtained by 514 Candidates in a 

CERTAIN Examination. 


Marks Obtained. 

No. of 
Candidates. 

Marks Obtained. 

1 

No. of 
Caiiiiidates. 

1 to 5 

6 

36 to 40 

79 

6 to 10 

9 

41 to 45 

50 

11 to 15 

28 

40 to 50 

37 

16 to 20 

49 

51 to 55 

21 

21 to 25 

68 

56 to 60 

6 

26 to 30 

82 

61 to 65 

3 

31 to 35 

87 





Total 

! 

514 


The table is to be read as follows ;— 


5 candidates obtained 1. 2, 3, 4, or 5 marks, 

® ” •» 8, 9, or 10 „ and so on. 

By straightforward addition it can evidently be rearranged so 
as to read thus :— 


5 candidates obtained 


14 

»> 

11 

42 

ft 

fl 

91 

ff 

fl 

149 

t$ 

fl 

231 

If 

t1 

318 

fl 

fl 

397 

If 

II 

447 

ff 

If 

484 

If 

ff 

505 

ff 

ff 

511 

If 

II 

514 

fl 

II 


not more than 6 marks. 


II 

II 

10 

II 

II 

II 

15 

If 

II 

II 

20 

If 

»l 

II 

25 

II 

II 

II 

30 

If 

If 

fl 

35 

>1 

II 

If 

40 

If 

II 

ff 

45 

It 

II 

ft 

60 

If 

ff 

ff 

55 

ff 

II 

II 

60 

If 

tl 

l| 

65 

fl 
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Ifc will be noted that in calculating the median no use is made of 
the marks of any of the candidates except those in the two groups 
in the immediate neighbourhood of the median, and it is one of 
the great advantages of this average that it can be found when an 
exact knowledge of the characters of the more extreme individuals 
in the series is not in our possession, and even when their measure¬ 
ment is impossible : it is enough if they can be roughly located. 
The arithmetic mean on the other hand is often unduly influenced 
by abnormal individuals which are not reaUy typical of the popula¬ 
tion in which they appear. 

Mode. If we measure or observe some organ or character for 
each individual in a given population, the mode, as its name sug¬ 
gests, is simply the organ or character of most fashionable or most 
frequent size. A large draper, for example, will have collars of 
several different shapes and sizes in his shop, but the fashionable 
shape and the predominant size correspond to the mode : it is the 
mode that sells most readily, and the intelligent draper will always 
have it in stock. Again, in Table (2), the disease mode or fashion¬ 
able disease among certain school children inspected in Surrey in 
1913 was measles, for a greater percentage of children had suffered 
from measles than from any other of the diseases recorded. 

Now when the variable in which we are interested is ‘ discrete, 
that is, when it changes by unit steps, leading to classes like ‘ tene¬ 
ments with 1 room,’ ‘ tenements with 2 rooms,’ ‘ tenements with 
3 rooms,’ and so on, it is an easy matter to pick out the class of 
greatest frequency : thus, in Table (6) there are more overcrowded 
tenements with 2 rooms than with any other number of rooms 
in the urban districts, so that 2 is the mode so far as this character 
(number of rooms) is concerned, whereas in the rural districts 3 is 
the mode, for there are more overcrowded tenements with 3 rooms 
than with any other number. There may be ambiguity, however, 
in determining the mode in this way for a grouped frequency dis¬ 
tribution when we are dealing with an organ or character subject 
to ‘ continuous variation.’ To cover such cases the modal value 
has been defined as that value for which the frequency per unit 
variation of the organ or character is a maximum. The precise 
significance of this wording will only be appreciated after discussing 
frequency curves : at present it must suffice to give a practical 
illustration of how the ambiguity arises and calls for some more 
refined treatment. 

For this purpose turn again to the examination marks in Table (7), 
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from which it appears that the mode, if it is to be the marks obtained 
by the greatest number of candidates, should lie in the group 
(31 to 35), since there are 87 candidates with marks between these 
limits, and this number exceeds that in any other group. But 
how are we to decide the exact point in the interval (31 to 35) which 
is to correspond to the mode ? Shall it be 33 ? We might say 
‘ yes ’ if the distribution were perfectly symmetrical on either side 
of the (31 to 35) group, but if we examine the neighbouring groups 
we see that the balance leans rather more heavily to the (26 to 30) 
group with a frequency of 82 than to the (36 to 40) group with a 
frequency of 79, and we might allow for this by interpolating in 
some way—ignoring, of course, any errors which may occur in the 
frequencies themselves o^ing to the observations being generally 
limited in number. But the pull in the direction of lower marks 
becomes still more pronounced to our minds when we contrast 
also the frequencies in the next groups on either side, namely 
68 and 60. So we might go on until the influence of the whole 
field of observations comes into action. 

Now it so happened that in this particular case the original 
marks-sheet was to be seen, and a regrouping of the candidates as 
in Table (8) makes it clear that the value found in this way for the 
mode may be artificially displaced sometimes to a serious extent 
by the particular method of grouping adopted. Thus, according 
to this new arrangement, the mode would seem to lie in the interval 
(28 to 32), the mid-value of which differs materially from 33, the 
mid-value of the previous maximum frequency group. 


Table (8). Marks obtained by 614 Candidates in a 
CERTAIN Examination (Alternative Grouping). 


MAtki Obtained. 

Mo. of 
Candidates. 

Marks Obtained. 

No. of 
Candidates. 

3to7 

10 

38 to 42 

73 

8 to 12 

17 

43 to 47 

46 

13 to 17 

36 

48 to 62 

31 

18 to 22 

56 

63 to 57 

12 

23 to 27 

47 

68 to 62 

3 

28 to 32 

108 

63 to 67 

3 

33 to 37 

74 





Total 

614 
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[It should be observed that while an alteration of the grouping 
may also affect the median, it does not affect it nearly to the same 
extent: e.g. the median determined from Table (8) is 31*3, which 
differs little from 31-5 the value obtained by the first grouping.] 

If, again, we combine the results of our two groupings to find 
the mode we might be tempted to conclude that it lies somewhere 
between the limits 31 and 32, but on examining the original records 
it was discovered that the real mode was 28. The frequency 
distribution of candidates in this neighbourhood was in fact very 
interesting ; it ran as follows :— 


Number of candidates who obtained 25 marks=14 

26 „ =10 

27 „ = 6 

28 =33 

29 =17 

30 =16 

The explanation of this peculiar distribution seemed to be that 
28 marks were required for a candidate to pass, and apparently as 
many candidates as possible were pushed over the pass line : if» 
on the first marking, a candidate was found to want only one mark 
to pass, the examiner presumably looked through his paper again 
and did his best to find an answer which by kindly treatment 
might be granted an extra mark. The effect of this leniency was 
ultimately to leave only 6 candidates in the division immediately 
below the pass line, and to swell the number immediately above 
to 33, which thus made 28 easily the ‘ most fashionable ’ mark of 
any, the next largest group of candidates being only 21. It will 
bo observed that even a candidate who wanted 2 marks to pass 
was treated in the same tolerant fashion, although it is not so 
easy, of course, for a conscientious examiner to discover two extra 
marks as it is to discover one; and if the candidate is 3 marks 
below the pass line it is still harder to give him the necessary lift 
to carry him over. Thus in the final list we find more condidates 
with 26 marks than with 27, and still more with 25 than with 26. 
If the above diagnosis is correct, and all marks-sheets tell the same 
tale, who shall again say that examiners do not temper justice with 
mercy 1 

This example has illustrated fairly clearly the difficulty of fixing 
the mode with any great precision by mere inspection when the 
individuals are arranged in groups, the value of the variable under 
discussion lying between prescribed limits for each group. While 
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it is possible to get a rough approximation to its value in this way, 
we conclude that for a really satisfactory determination we require 
some method which makes use of the whole distribution, as m tne 
determination of the mean, and not merely of the portion in the 
supposed neighbourhood of the mode. This must be left to a later 
chapter; we shall only point out before passing on that there 
may sometimes be more than one mode in a given frequency dis¬ 
tribution just as there may he more than one fashionable type of 
collar which it is expedient for the draper to stock in large quan¬ 
tities. The second grouping in the examination example suggests 
such a possibility, for it will be noticed that the frequencies of 
candidates do not rise steadily to a single maximum at 108 for 
class (28 to 32), and then fall steadily : there is a previous rise and 
fall in the neighbourhood of class (18 to 22). 

Weighted Mean. Let us suppose a farmer employs for the 
harvest 5 men, 3 women, and 4 boys. In estimating the amount 
of work they can do in a given time it is clear that in general a 
woman or boy cannot be reckoned as equal to a man. He must 
therefore decide what ‘ weight ’ must be given to each in proportion 
to a man. If a woman’s work be taken, for example, to be three- 
quarters as effective and a boy’s work to be half as effective as 
that of a man, we have as the appropriate proportional weights 

1 :1 : or 4 : 3 : 2. 

Hence 5 men, 3 women, and 4 boys would on the average be equiva¬ 
lent in output to 

(6+3x|+4xJ) men 

4x5*l”3x3-}-2x4 
=--- men 

4 

=9^ men. 

An average of this type is called a weighted mean, 1, J, and 
\ being the weights, because they tell us what weight to give to 
each separate worker in calculating the average- 
Let us consider the effect such weighting has in general upon a 
mean, and for this purpose we shall test it on a set of index numbers 
measuring rents in certain groups of towns in 1912, as given in a 
Repori on the CoH of Living of the Working Classes issued by the 
Board of Trade (Cd. 6955). 



30 


STATISTICS 


Table (9). Mean Index Ntjmbees of Rents fob certain 
Geographical Groops of Towns in 1912 {with eeferenoe 
TO Middle Zone of London as standard = 100). 


(1) 

(2) 

(3) 

(4) 

(B) 

(«) 


Rents. 

Ko- of 
Towns 

Each 

Group 

Arbitrary 

Approxi¬ 
mate sub- 
multiples 

Geographical Group. 

includeu 
In the 
Group. 

counting 
as 1. 

WeighU. 

of Kos. in 
previous 
column. 

Northern Counties and Cleve¬ 




1 


land .... 

660 

9 

1 

27 

3 

Yorkshire (except Cleveland) 

58-5 

10 

1 

64 

6 

e 

Lancashire and Cheshire 

56-9 

17 

1 

45 

0 

Midlands .... 

52-3 

14 

1 

125 

14 

Eastern and East Midland Cos. 

53-4 

7 

1 

63 

7 

Southern Counties 

63-7 


1 

14 

2 

Wales and Monmouth . 

64-8 

4 

1 

22 

2 

Scotland .... 

62-0 


1 

178 

20 

Ireland .... 

61-7 

6 

1 

55 

6 

Average • • • 

• • 

68-4 

58-8 

67-6 

■ 


The first mean in the above table, 58-4, is obtained by multiply* 
ing (or weighting) the mean rent of each geographical group by the 
number of towns in the group, given in col. (3), adding the numbers 
so obtained, and dividing the total by the total number of towns, 
thus:— 

9(660)+10(68-6)+ . . . +6(61-7) 

9 + 10 + . . . + 6 * 

This is simply the arithmetic mean treating each town as unit. 

The second mean, 68-8, is obtained by adding the mean rents of 
all the groups and dividing by the total number of groups, thus :— 

66 0+68-5+ . . , +61-7 
1 +1 + . . . + 1 

This is the arithmetic mean treating each geographical group as 
unit. 

The third mean, 67-6, is obtained by multiplying, or weighting, 
the mean rent of each group by a perfectly arbitrary number given 
in col. (5); the numbers selected were taken quite at random from 
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another column of figures in another Blue-book, and had no con¬ 
nection whatever with the subject of rents ; this give? •— 

27(66-0)+54(58-5)+ . . . +55(51-7) 

27 + 54 + . . . + 55 ■ 

The last mean, 67*6, is obtained by choosing as weights any 
numbers (and for simplicity we choose the smallest) as in col. (6) 
which are very roughly proportional to the arbitrary weights used, 
in the last instance ; we thus get:— 

3(66-0)+6(58-5)+ . , . +6(51-7 ) 

3 "h 6 + . . . 6 

Now the first of these means is clearly the most satisfactory, since 
it is the result of very properly weighting the mean rent of each 
group of towns according to the number of towns the group con¬ 
tains, But the second result shows that if we are ignorant of the 
number of the towns in each group we shall not be very far out in 
our calculation if we treat them all as of equal importance, and find 
the simple arithmetic mean of the mean rents in the nine groups. 
We can even go further, for we find, from the third and fourth results, 
that by weighting the mean rents in the various groups on quite a 
random basis, the mean we get still does not differ very greatly from 
the best value first found. 

The important principle of which the above example is an illus¬ 
tration is perfectly general, and may be stated as follows: If the 
total number of measurements or observations be not very small, 
and if the resulting values of the organ or character measured 
(rent in our case) be not very imequal, any reasonable selection of 
multipliers or weights (as, for instance, the first two adopted above) 
will give means which differ from one another by but little ; and 
even an apparently unreasonable selection of multipliers (as, for 
instance, the third adopted above), assuming they are not so 
^dly chosen as to give any particular group a very unfair weight 
in comparison with the others, will not throw the mean out badly. 
Further, in place of a set of large multipliers we may substitute 
small numbers which are roughly proportional to them (as we have 
done in the fourth case above), and the mean will again be very 
little affected. [See Part IT, p, 203.] 



CHAPTER V 

AVERAGES {continued) 

Applications of Weighted Mean. In determining tho weighted mean 
of a set of observations it is usual, of course, to weight each observ^ 
tion according to its importance, though what number should be 
chosen as a measure of its importance may sometimes be a matter 
of doubt. It is not a very difficult matter to decide when we 
wish, for example, to compare birth, marriage, or death rates^ m 
two districts, if we know how the constitution of the population 
in tho one district differs from that in the other, for the weightmg 
in each of these cases must be in proportion to the population 

concerned, and it is too important to ignore. 

Death raie, crude and corrected. Imagine a city in whic ® 
total number of deaths in a certain year is N out of a population 

numbering P. . i. K 

The ordinary or crude death rat© for that city will then e 

^XlOOO, by definition. 

Now this number N may be analysed according to the ages of 
the people who have died ; let us suppose it is made up of 

rij people between limits 0 and leas than 5 years of ago. 

na „ „ „ 6 „ 16 ». 

nj ,, „ „ 16 „ 26 M 

and so on, where 

+ 

Again the number P may be analysed according to the ages of 
the people who compose the total population, giving, say, 

Py of the population between limits 0 and less than 5 years of age, 
P2 >» >» » ’> ® >» ” 

Ps ,, >» »> »i 16 ,, 26 •• 

and so on, where 

P1+P1+P1+ • • • =JP- 

St 
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Thus we may write for the crude death rat© 


D=5x1000 


__ni-|-7l2-j-«3-}- . . . 


xlOOO 


=’^1000+5l000+5'l000+ 


= tPidl+Pjd,+P3d,+ . . .)/p, 

where djis the death rate between limits 0 and less than 5 yearsof 


age, 


tl 


ft 


ft 


t> 


5 

15 


tt 


tl 


15 

25 


1» 


«» 


and so on. 

Now if we compare this expression with the corresponding one fox 
another city, say, 

+p;<+j>X+ .. .)/p', 

it is quite conceivable that the death rates in the various age grouns 
might be equal— a a 

‘*i=<. <i.=<. d=d^ . . . 

and yet D might exceed D' because in the first city there are a 

peater proportion of infants or old people, on which classes the 

hand of death falls heaviest, that is. because the p’s or weights 

which miUtiply the biggest d's are greater in the first case than in 

the second. But so long as the d’s in the two cities are equal, age 

group by age group, it would be reasonable to regard the cities L 

equaUy healthy, or unhealthy as the case might be, and therefore 

to insure a fan comparison it is usual in the Reports of the Registrar. 

General to give a corrected death rate in place of the crude death 
rate deoned above. 

This is done by weighting the death rate for each age group, not 
m pro^rtion to the actual number of persons in that group in 
the city Itself, but m proportion to the corresponding number in 

o 
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the country at large. Thus, if we denote the proportion of the 
population, Q, 


between limits 0 and less than 5 in the country at large by Ji/Q* 


it 

.. 5 


15 

99 

99 

91 

„ 15 

91 

25 

1 9 

9f 


32/Q. 

Js/Q- 


atnd 80 on, we get as the corrected death rate 

+ + • • O/Q* 

a form which has the effect of making the results agree in two 

cities which have equal d’s throughout. 

A similar method of correction is clearly applicable in consider¬ 
ing the incidence of the death rate when we are concerned not 
with a difference of district but with a difference of sex, occupation, 
religious profession, wage-earning capacity, or any other 
defined character. Further, it may be used also in comparing birt 
rates, marriage rates, heights, weights, chest measurements, or any 
similar attributes, when it is necessary to refer the observatioM 
or measurements to a standard population in order to avoi 
complications due to age variation. 

There is another method of correction, equally general in app "ca 
tion, which is useful when the death rates in the various age groups 
are not known. In this case D, the crude death rate for the who e 
population of the district is known, alsopj/P, Pa/f’j 
proportions of the population between the various age limits, but 
<^ 2 > supposed unknown. 

Now if the population in the country as a whole were the same m 
corresponding age groups as it is in the district under consideration, 
we should get as the death rate for the whole country 


where Sj, 82 . S 3 • • • are the death rates in the various age groups m 
the country at large, and these would in practice as a rule be known. 
The actual death rate for the whole country is, however, 


(31^1 + 32^2 + 38^8+ • • • )/Q» 

where qJQ,, 32 /Q> 33 /Q ■ • • denote, as before, the real proportions 
of the population in the various age groups in the country at large- 
We take as the corrected death rate required for the district • 
number hearing to the crude death rate the same ratio as 

(3i^i+3j^i+ • • -VQ bears to (Pi8i+P|8>+ • • 
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Hence we have 

corrected death rate 3181 + 92324 - ... P 

® Pi8i+i^2S2+ ■ • ■ Q 

/nd&r Numbers to compare Household Budgets. Another highly 
important illustration of a weighted mean occurs in the search for a 
satisfactory measure of the change in the cost of living from year 
to year. We have already introduced the subject of variation in 
wholesale prices, and we have seen that Sauerbeck, in forming his 
index numbers, treats as one each of the forty-five commodities 
he uses to measure this variation: the observations, that is to 
say, are not weighted. 

But, confining our attention to food alone, supposing we have 
five items, such as bacon, bread, tea, sugar, milk, for which the 
index numbers of prices at two different dates are :— 



Bacon. 

Bread. 

Tea. 

Sugar. 

Milk. 

First date 

100 

100 

1 

100 

1 

100 1 

100 

Second date 

117 

05 

04 

102 ' 

109 


Is it really right to treat each of these items as of equal importance 
with the rest, or ought we to regard bread and tea, say, as of more 
weight than bacon, and count bread perhaps five times and tea 
three times while counting bacon only once ? It is clear that, in 
order to select a reasonable set of multipliers in this case, we should 
need to know the standard of living of the class of people under 
consideration, and how much in the aggregate they spend upon 
bacon and how much upon bread, etc. 

A partial answer to these questions can be obtained by making 
a collection of household budgets as was done, for example, by two 
Government Committees which recently reported (1918-19) on the 
Cost of Living among the Urban and the Agricultural Working Classes 
respectively. If the number of commodities employed is large, 
even an arbitrary set of multipliers, as we have indicated, will not 
displace the mean any great distance from the value when reason¬ 
able weights are chosen, but unfortunately in collecting such house¬ 
hold budgets we are confined to the comparatively limited variety 
of food-stuffs which are in general use. 

Different principles may be followed in making the comparison 
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between one year and another which may be illustrated by a few 
figures from the Urban Classes Report (1918) : 

Table (10). Household Budgets showing Pbices or each Com¬ 
modity AND Quantities Purchased at Two Difeerent 
Dates by Typical Family. 


Conamoditf* 

First jeftr (1914). 

Second jear (1918)4 

Trice {pence 
per lb). 

Ko. of Ib. 
bought. 

, 

Price (ponce 
per lb.) 

No- of lb. 
bought. 

Sugar * 

Tea 

Potatoes 

• • 

• • 

2-2 

21-3 

0'7 

9 e 

# e 

59 

0-68 

15'6 

9 e 

9 e 

7-07 

33-3 

1-26 

• • 

» 4 

2-83 

057 

20-0 

• 4 

• • 


Let Xi be the price, in pence per unit, of any one commodity 
at the first date, and let be the number of units of this commodity 
bought per week by a typical family (n may be estimated in different 
ways, e.g. (1) by dividing the total number of umts bought by 
all families by the total number of those families, or (2) by ranging 
the different amounts bought by different families in order of 
magnitude and picking out the median amount, or (3) by choosmg 
the mode, i.e. the amount most commonly purchased). Also let x, 
be the price, in pence per unit, of the same commodity at the second 
date, and let ng be the number of units of the commodity then 
bought per week by the typical family estimated in the same way 
as before. 

The actual expenditure, measured in pence, at the two dates 
will then be 

and iTfXgna) 

respectively, where simply denotes the sum of expressions 

like (x^nj) for all the commodities recorded and ^’(Xjnj) denotes the 
sum of expressions like (x^nj) for all the commodities recorded, 
S, the old English S, being a well-known conventional abbreviation 
for ‘ Sum of expressions like.’ Thus, with the numbers in Table (10), 
we should have 

i;(a:ini)=(2-2){5-9)+(21-3)(0-68)+(0-7)(16-6)+ . . . 
i7(x,n,)=(7-07)(2-83)-bf33-3)(0-67)+(l-26)(20 0)+ . . . 
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Taking 100 as the index number to represent expenditure at the 
first date, the index number measuring expenditure at the second 
date may be formed in any of the following different ways,* which 
as a rule, of course, lead to different results :— 

( 1 ) 

(2) I 00 £{x 2 n^)l 2 {xjn{i or l 00 Z{x^n 2 )/I^{Xjn 2 ); 

(3) I002‘(a:in2)/X(a:j7i]) or 100ir{r27i2)/ir(a:2nj). 

The first of these expressions compares the actual expenditure at 
the second date to that at the first date. 

The next two expressions take into account directly only the 
change in prices; they compare, not actual expenditures but, the 
expenditures at the two dates as they would be if the amounts 
purchased at the two dates were the same: the first supposing 
these amounts to equal those actually bought at the first date, 
and the second supposing them to equal those actually bought 
at the second date. 

The last two expressions, on the other hand, take into account 
directly only the change in amounts purchased; they compare 
the expenditures at the two dates as they would be if the prices 
ruling at the two dates were the same : the first supposing these 
prices to equal those actually charged at the first date, and the 
second supposing them to equal those actually charged at the 
second date. 

The particular method of weighting adopted must naturally 
depend upon the circumstances of the period under discussion 
and the nature of the inquiry one is making ; it is a nice question 
to decide how far emphasis should be laid upon the old standard 
of life (measured by food, lighting, rent, recreation, etc.) with the 
expense required to maintain it, and upon the new standard of life 
and the cost necessary to reach it. 

It may be useful here to summarize a few of the questions of 
interest which present themselves in connection with the formation 
of index numbers of prices designed to measure changes in the 
value of money in general without reference to any particular class 
of the community :— 

1. What years should be selected in fixing our standard prices 1 

2. What commodities should be chosen as a basis for our 
average I 

1* 8m aUo The Meaewement of Chemgee tn the Cott of Living, bj A L. Bowlej, SaD., 
In th* Journal of the Boyal Aatietioal Society, Iday 1919, for * moro eompl«t« dij. 
euiion of (Iw rabjoot.] 
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3. What weight should be given to each commodity in relation 
to the rest 1 

4, How should the prices of the several commodities be deter¬ 
mined, bearing in mind that ‘ price ’ itself frequently varies from 
place to place ? 

6. Finally, how should these prices be combined to give the 
average required ? Should we use the simple arithmetic mean, the 
geometric mean familiar to students of Algebra, the median, or 
some other measure 1 

While we are not prepared to attempt to answer these questions 
fully, seeing that authorities are not altogether agreed as to what 
the answers should be, one or two points may be worth noting. 
Generally speaking we may say that:— 

1. The years selected in fixing our standard prices should be 
years in which economic conditions were normal rather than 
abnormal. 

2. The commodities chosen should be articles of general con¬ 
sumption, and as wide a field as possible should be covered in their 
choice. 

3. Many consider that little is gained by weighting, but, if 
weights are introduced, the greater the importance of any com¬ 
modity in relation to the rest, judged for example by the relative 
quantity consumed, the greater should be the weight assigned 
to it. 

4. The practical difficulty of assessing retail prices when they 
are uncontrolled compels us in general to fall back upon whole¬ 
sale quotations, on which some light may be thrown by keeping 
under observation the important markets for the sale of each 
commodity. 

6. The average commonly used is the simple arithmetic or the 
weighted mean, though arguments can be adduced in favour of 
other averages such as the median. 

Leaving index numbers now on one side and returning to the 
general subject of averages, we may remark that the question 
which average is correct in any given case, the mean (weighted or 
otherwise), the median, or the mode, does not arise : no one average 
is more correct than another, because they are all entirely con¬ 
ventional and represent dififerent ideas; they correspond in fact 
to so many different ways of summing up a set of observations or 
measurements in a single numerical statement, and the real question 
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to determine is which statement, which kind of average, brings the 
set of observations before us to the best focus. 

For this purpose one average will clearly bo best in one case and 
another in another, but it may be stated without hesitation that 
the arithmetic mean is certainly the most useful of the three and 
it is the most Irequently used. Other averages, such as the 
geometric and the harmonic means [See Part II, p. 204], are suit¬ 
able in special classes of problems. The geometric mean is of 
particular interest in the construction of price index-numbcTs.* 

In a reasonably symmetrical distribution of observations, one in 
which the variables of medium size are the most frequent and the 
frequency diminishes about equally on either side towards the 
largest and the least of the variables, the values of the mean, the 
median, and the mode will be found to lie all very close together; 
and a useful practical rule to remember is that the median comes 
in general between the mean and the mode, the difference between the 
mean and the mode being about three times the difference between the 
mean and the median. This rule, for lack of a better, might be used 
to determine the mode in suitable cases, or it might be used to test 
the value found in some other way. 

The general term ‘ average ’ is frequently used when the par¬ 
ticular denomination ‘ arithmetic mean ’ is implied, but the context 
will usually prevent misunderstanding. 

In order to get a clear impression of the outstanding features 
presented by the three chief averages discussed, let us go over them 
once more in the case of marks awarded to a number of students 
in a class. All three may be regarded as in a sense measures of 
the standard reached by the class as a whole in the examination, 
but the measures are made in different ways :_ 

1. The Arithmetic Mean is found by merely dividing the aggregate 
marks of the class by the number of the students, and it gives the 
marks earned by each student if we conceive them all to be of 
equal merit. 

2. The Median is found by ranging the students in order of merit 
from top to bottom, and picking out the marks awarded to the one 
who comes half-w'ay down the list. 

3. The Mode is the most fashionable number of marks, i.e. the 
marks obtained by the greatest number of candidates. 

The advantages and disadvantages of the three types may be 
set out broadly as follows, although the boundary lines must not 
be too strictly drawn :— 

* Bee IiIoU OQ ^ 
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Meao. 

Median. 

Mode. 1 

Easy to calculate when 
the values of the vari¬ 
able can be summed 
and their number is 
known. 

1 

Easy to pick out when 
the individuals can 
be ranged in order 
according to the 
value or degree of 
the variable ob- 
served. 

1 

Not easy to determine , 
with precision, when 
the observations fall 
into groups of differ¬ 
ent ranges, without 
fitting a frequency 
curve to the distribu¬ 
tion as a whole. 

Well designed for alge¬ 
braical manipulation, 
as, for cxainple» when 
we wish to combine 
different sets of obser- 
vatiotH (see Fart II, p. 
2(33, Note 4, for two 
illustrations]. 

Unsuited for algebrai¬ 
cal work* 

Unsuited for algebrai¬ 
cal work. 

1 

Affected sometimes too 
much by abnormal in¬ 
dividuals among the 
observations. 

Determined merely by 
its position in the 
distribution, and its 
actual value is thus 
quite unaffected by 
abnormal individuals. 

Unaffected by abnor¬ 
mal individuals, and 
owes its importance 
to the fact that it is 
located in the region 
where the frequency 1 
is most dense. { 


The reader should test his grasp of the principles so far intro* 
duced by applying them himself to a concrete case. For example, 
he might use the data in Table (ll), with regard to wages earned 
by certain women, taken from Tamiey’s Minimum Wages in the 
Tailoring Trade, and based upon the 1906 Wages Census. Let him 
begin by roughly estimating the mean, the median, and the mode 
from an inspection of the distribution. He might then proceed 
to calculate the mean wage :— 


(1) talcing the actual frequencies given in the table ; 

(2) taking simple sub-multiples of these frequencies, roughly one- 

hundredth part of each : 2, 4, 6, 7, 9, 11, etc.; 

(3) assuming unit frequency in place of that given in the table for 

each wage group. 


Finally, he might determine the median and the mode in the 
manner explained in the text, deducing the latter from the relation 
(mean—mode }=3 (i 

K n-'i 
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The results obtained should be 

(1) 13-08S. ; (2) 13-lOs. ; (3) 16-593. 
Median=12-53s.; Mode=ll-43s. 


Table (11). Distkidution of Wages of CERTAra 

Women Tailors. 


W (2) (*) (4) 





No. of Women 





No. of Women 

Wagei between limiU 

earning wages 
as shown in 

Wages between limits 

earniiig wages 
ae shows in 
Column (3). 




Column (1). 





59. and less than 

63. 

ISO 

16s. and less than 178. 

642 

69, „ 


78. 

384 

179. 

tt 

ft 

Ids. 

453 

78. „ 


88. 

653 

18s. 

ft 

ft 

198. 

401 

89. „ 

»» 

93, 

690 

19s. 

ft 

ft 

20s. 

272 

99. „ 


lOs. 

! 900 

203. 

ft 

ft 

219. : 

251 

10s. „ 

• » 

lls. 

1146 

21s. 

ff 

ff 

223. 

138 

11a. 

it 

123. 

1201 

229. 

ft 

ft 

23s. 

124 

12a. „ 

ft 

133. 

1138 

238. 

ff 

ft 

248. 

64 

13s. „ 

*> 

14b. 

930 

1 248. 

tf 

tf 

2ds. 

54 

148. „ 


15s. 

886 

259. 

ft 

• 1 

30s. 

122 

16b. „ 

tf 

163. 

790 

1 •• 

1 



as 1 

• • 


• [Tho most important oxampio of the use of the geometric mean in thin con¬ 
nection IS in the construction of tho Board of Trade Index Number of Wholesale 
Prices-^ Supplement to Hoard of Trade Journal. Jan. 24th, ltf35: also, an article 
in the Journal of the lioyal Statietical Society, M.-irch 1921.] 











CHAPTER VI 


DISPERSION OR VARIABILITY 

Let us suppose that two men set out separately on walking tours 
and that they walk as follows :— 



First Man 
walks 

Second l^Ian 
walks 

First day . 

20 miles. 

15 milea 

Second „ . 

20 „ 

20 „ 

Third „ . . . 

25 „ 

25 „ 

Fourth. 

25 „ 

25 „ 

Fifth „ . . . 

30 „ 

30 „ 

Sixth „ . 

30 

36 

6 days 

150 miles. 

150 miles. 


The total distance covered in sis days, namely 160 miles, and 
therefore also the mean rate of walking, 25 miles a day, are thus 
exactly the same in both cases, but the dispersion of the values of 
the variable (the variable being in this instance the number of 
miles walked per day) round about their mean value, the variability, 
is different in the two cases. The greatest deviation from the 
average in the first case is five and in the second case it is ten miles. 

Thus, besides knowing the average of a set of values of a variable 
it is important to measure the dispersion of the distribution. Are 
the observations crowded in a dense mass around the average, 
or do they tail off above and below it, and to what extent ? 
In other words, what is the variability from the average of the 
distribution t 

Mean Deviation. Now we are not concerned here with the signs 
of the separate deviations, with the question, that is, whether any 
particular value of the variable lies above or below the average: 
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it is only of their amount we wish to take cognizance, and perhaps 
the most obvious way to measure the total variability and at the 
same time to ignore the signs of the separate deviations from the 
average is to add up these deviations, treating them all as signless, 
and to divide the result by their total number. This gives what 
is known as the mean deviation of the system of observations—it 
is the ordinary arithmetic mean of the separate deviations, treated 
as if they are all in the same direction, and. in measuring them, we 
may use either the mean or the median as the average, but it 
would seem preferable to take the latter because the mean deviation 
is least when the median is chosen as the origin, or zero point, from 
which the differences are measured. The proof of this fact will 

bo found in Part II, p. 270, Note G, but we may readily test it in 
a given case. 

Let us adapt the ‘ walking ’ illustration used above, slightly 
extending the figures and making them unsymmetrical, i.e. of 
unequal variability on either side of the average, so as to prevent 
the median coinciding with the mean. We then have an amended 
table setting out the number of miles walked by a certain man on 
successive days during, say, a fortnight’s tour, as follows :_ 

Table (12). Number of Miles walked on Successive Days. 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(«) 

/ 

No. of 
day*. 

MiUi 

walked. 

1 

z 

Deviation 
from 25. 

Deviation 

from 

24*64. 

1 

' ^ 
Deviation 
from 24. 

1 

xa 

Deviation 
from 20. 

, /* 

(No. in 

Col. (l)lx 

[No. ill 

Col. (3)]. 

/r) 

[Nu in 

Col. (l)]x 
[No. in 

Col. (4)]. 

1 

10 

16 

14-64 

14 

16 

15 

14*64 

2 

16 

10 

9-64 

0 

11 

20 

19-28 

3 

20 

5 

4-64 

4 

6 

16 

13-92 

3 

26 

• • 

0-36 

1 

1 


108 

2 

30 

5 

6-36 

6 

4 

10 

10-72 

2 

36 

10 

1036 

11 

9 

20 

20-72 

1 

1 

40 

16 

15-36 

16 

14 

16 

15-36 

14 

• • 

« • 

• • 

* • 

• « 

95 

96-72 


The ffrst two columns show that 10 miles was the distance walked 
on the first day, 16 miles on each of the next two days, 20 miles 
on each of the next three days, and so on until the last day, when 
40 miles was the distance walked. 
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The median in this case, being the number of miles walked on 
the middle day when the days are ranged in order of mileage from 
the least to the greatest, is 25, for this is the distance covered on 
both the seventh and the eighth days which come half-way along 
the series. 

Col. (3) shows the deviations from the median, 25, of the distances 
covered each day as recorded in col. (2), and col. (7) enables us to 
sum these deviations when each is multiplied by the number of 
days to which it corresponds, since these numbers, given in col. (1), 
show how many times each deviation is repeated. Hence the mean 
deviation, regardless of sign, measured from the median 

= [(lXl5)+(2xl0)-f(3x5)+(2x6)+(2xl0)+(lxl5)]/U 

= (15+20-1-15+10-h20+15)/14 

=95/14 

=6-79 miles. 


We may compare this with the corresponding deviations measured 
from (1) the arithmetic mean, (2) the number 24, and (3) the 
number 26 as origin respectively. 


1. The arithmetic mean of the distribution is obtained at once 
by multiplying the corresponding numbers in cols. (1) and (2), 
adding the results, and dividing the total by 14, thus 


Arithmetic mean= 


l(10)+2(15)+3(20)+3(25)+2(30)+2(35)+l(40) 

l+2+3+3-h2+2+l 


_10+30+60+75+GO+70+40 

U 

=345/14 
=24*64 miles, 


and the deviations from 24*64 are shown in col. (4); the mean 
deviation from 24*64, obtained by combining cols. (1) and (4) and 
adding as 8ho^vn in col. (8) 

=[l(14-64)+2(9-64)+ , . . ]/14 

=95*72/14 

=6*84 miles. 


2. Similarly, the mean deviation from 24, making use of col. (6), 


=[l(14)+2(9)+ . . . ]/14 
=6*93 miles. 
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3. And the mean deviation from 26, making use of col. (6), 

=[l(16)+2{li)+ . . . ]/U 
=7*07 miles. 

The original determination gives a value which is less than any 
of these three results, as was anticipated. 

The mean deviation from the median is, however, difficult to 
calculate with exactness when the observations are recorded in 
groups between different limits : for this and other reasons we 
shall not spend much time upon it, and we shall as a rule choose 
the mean as origin of reference rather than the median. It 
may be as well to explain the source of the difficulty by a small 
hypothetical illustration 

Let us suppose that in making measurements of some organ or 
character in 13 individuals we get a result Ijnng between 4 and 6 
units on six occasions, between 6 and 8 units on four occasions, and 
between 8 and 10 units on three occasions. Here, assumiTig that all 
the individuals in any group have the mid-value measurement for 
that group, i.e. treating the distribution as one of 6 individuals 
with a variable measuring 5 units, 4 individuals with a variable 
measuring 7 units, and 3 individuals with a variable measuring 
9 units, we get ^ as the mean deviation with 7 as origin and ^ 
for the mean deviation with 6'5 as origin, as the following table 
shows:— 


Mea^aremesi. 

/ 

Fre^^uenej. 

X 

Deviation 
from 7. 

^ y 

DeviAtion 

from 

A 


4 and less than 6 

6 

2 

1-6 

12 

9 

6 »» M 8 

4 

0 

05 


2 

8 M „ 10 

3 

2 

2-6 

6 

7-6 


13 

» # 

« • 

18 

18-6 


Now the result obtained is in agreement with the minimum 
mean deviation theory, granted that 7 is the median measurement, 
as it might certainly be. But it is not so of necessity, and in that 
case the assumption italicized might lead, in the above calculation, 
to appreciable inaccuracy unless the number of observations is 
larste and the class-interval is small. For example, the actual 
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difitributioQ might, without oontradioting the previous data, oon« 
ceivably run :— 


Measuremeot. 

1 

Frequenej. 

x' 

Deviation 
from 7 . 

Deviation 
from 6*5. 


/V 

6 

6 

2 

1-5 

B 

9 

6-5 

2 

0-5 

# # 

■n 

1 # 

7-5 

2 

0-5 

1 

1 

2 

9 

3 

2 

1 

2-6 

6 

7-6 


13 

• • 

• • 

20 

18-5 


But in this case the median, the measurement for the seventh indi¬ 
vidual from either end of the series, is 6'5, and according to the 
first calculation the mean deviation referred to 6*5 as origin appears 
to be greater than that referred to 7 as origin. If, however, we 
recalculate, using the more detailed table, we find that the mean 
deviation referred to 6*6 as origin (^) is really less than the mean 
deviation with reference to 7 as origin, as it should be, for the 
latter now turns out to be * 3 . 

Standard Deviation. An alternative method of avoiding the 
signs of the deviations from the average in order to estimate the 
amount of variability of the distribution is to square each separate 
deviation, sum the squares, divide by their number, and take the 
square root of the result. This gives the root-mean-square deviation, 
and it is least when the arithmetic mean of the variables is chosen 
as origin from which to measure the deviations, w'hen it is known 
as the standard deviation. For proof of this minimum principle 

see Part II, p. 266, but it is worth wbilo testing it also with the 
data given in Table (12). 

The numbers in cols. (3) to ( 6 ) in Table (13) are obtained simply 
by squaring the corresponding numbers in the same cols. ( 3 ) to ( 6 ) 
in Table (12). Col. (7) is formed in order to enable us to calculate 
the mean-square deviation referred to 25 as origin; the numbers 
in col. (3) show the squares of the deviations for each individual 
observation, and the numbers in col. ( 1 ), by which they are multi¬ 
plied, show how frequently the same values are repeated. Hence 
we get the mean-square deviation with reference to 25 

=[l(226)+2(100)+3(26)+2(25)+2(100)+l(226)]/U 

=975/14 

=69-64. 
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Thus the root-mean-square deviation referred to 26 

= V(69-64) 

=8-345. 

Similarly, by means of col. (8), formed on exactly the same 
principle, we find that the root-mean-square deviation referred to 
24-64 as origin 

= VI(214-33+185-86+ . . . )/14] 

= V(973-22/14) 

=8-338. 

But 24-64 is the mean of the distribution, hence 8-338 is the standard 
deviation. 

With the help of cols. (5) and (6) the student may himself calcu¬ 
late the root-mean-square deviation with regard to 24 and 26 
respectively as origin; the results should be 8-36 and 8-45. Of 
the four values thus obtained for the root-mean-square deviation, 
the least is that referred to the mean as origin, the standard devia¬ 
tion, now proposed as a measure of variability or dispersion suitable 
for most general purposes. 

This measure possesses several decided advantages over the 
mean deviation ; among others it lends itself more easily to certain 
algebraical processes (e.g. see Part II, p. 158), a fact of importance 
when we wish, for instance, to discuss two sets of observations in 
combination, and it is in general less affected by ‘ fluctuations of 
sampling ’—errors which arise owing to the fact that we cannot as 
a rule survey the whole field of operations, but have to be content 
with a sample. 


Table (13). Numbee of Miles walked on Successive Days. 

_ W (g) (6) (T) _ (8) 

i/. « , o *'* ^ *>* I 

* »»it Sqoye Square of Square of Square of [No.iaCol (1)1 No.inCol (1) 

of Mlee DeriatloD Deviation Deriation oUatioD 

day*, walked, from 25. rrom24-64 from 24. from2C. [No.lnCol.(3)J;No.lnCol.(4) 





226 
100 
26 
# # 
26 
100 
226 


214-33 

92-93 

21-63 

0-13 

28-73 

107-33 

23593 


196 
81 
16 
1 
36 
121 
26 


256 

121 

36 

1 

16 

81 

196 


226 

200 

76 

60 

200 

226 


214-33 

185-86 

64-59 

0-39 

67-46 

214-66 

236-93 


973-22 
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Quartile Deviation or Semi-interquartile Range. There ia a third 
measure of dispersion, based U[)on the determination of the quartiles, 
and to introduce them we may refer again to Table (7) in order to 
show how tne idea of the median may be extended. 

VVe define the individual occupying a position one-quarter the 
way along any series of observations, arranged in ascending order 
of magnitude of some organ or character common to all the indi¬ 
viduals of the series, as the lower quartile ; and we define the indi¬ 
vidual occupying a position three-quarters the way along the series 
as the upper quartile. 

When the distribution of observations is divided up into groups 
lying between different limits of the variable under consideration 
the quartiles may, like the median, be calculated by interpolation. 
Thus, in the examination example, the total number of candidates 
is 614 and i(514)=128-5. 

But the 91st candidate from the bottom gets approximately 20 
marks, and the 149th candidate from the bottom gets approxi¬ 
mately 25 marks. Hence the imaginary candidate. No. 128*5, 
should get a number of marks lying somewhere between 20 and 
26. But if, in this neighbourhood, a difference of 


(149-91) candidates corresponds to a difference of 6 marks, 

37 *5 

(128*6-91) ,, should correspond ,, 6x-^-marks. 

58 


Thus, the marks assigned to the lower 
approximately 


= 20 - 1 - 


5x37*6 


58 


quartile candidate are 


=20-1-3*23. 


Hence the lower quariile=2'^'2Z. 


Again J(514)=385*5. 

But the 318th candidate from the bottom gets approximately 35 
marks, and the 397th candidate from the bottom gets approxi¬ 
mately 40 marks. Therefore, the imaginary candidate, No, 386*6, 
should get approximately a number of marks 


oir , IT 67*6 
^ 35 “h 5 X' 

79 


=39*27. 

Bence (he upper quarlile=Z^-21. 
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31'52 


Med. 


39-27 


It is clear that the quartiles together with the median divide the 
whole series of observations into approximatelj four equal groups, so 
that the quartile marks 

give a rough idea of the 23-23 31-52 39-27 

distribution on either i. .j ' Jv, * 

, y Med. Q 

Side of the average. For 

this reason half the difference between the quartiles provides a 
convenient measure of the dispersion, and it is called the quartile 
deviation or semi^interquartile range; thus, if Q be the lower and 
Q' the upper quartile, we have 

the quartile devialion=\{Q'—Q). 

In the above example, this measure 

= i(39-27-23-23) 

= J(16-04) 

= 8 - 02 . 

If a more minute analysis of the distribution of variables is 
desired, we may range them in order of magnitude as before, and 
divide up the series into ten equal parts, recording every tenth along 
the line; these tenths are called decilc3. 

Thus, the deciles in the examination example correspond to the 
marks assigned to imaginary candidates numbered as follows :— 
61-4, 102-8, 164-2, 205-6, 257-0, 308-4, 359-8, 411-2, 4G2-6, 
and they can be calculated by the interpolation method used in 
finding the median and quartiles. 

This way of representing the chief features of a distribution, by 
quartiles, eto., was much used by Galton in his researches and 
writings. 

The student may be perplexed tui to \?'hich should be used of so 
many different measures of dispersion or variability, but there 
need be no real confusion. If a rough estimate only is wanted the 
quartile deviation is a convenient measure, assuming that the 
vanables observed or measured can be ranged in order of magnitude 
80 as to admit of the quartiles being readily picked out. Also the 
measure thus obtained is not unsatisfactory when the distribution 
of values of the variable is fairly symmetrical and uniform in its 
gradation from greatest frequency to least. If, however, it is 
conspicuously skew (unsymmetrical) and there are erratic differ¬ 
ences m frequency between successivo values of the variable, it 
*8 better to choose a measure which gives the magnitude and 
the position of each recorded observation its due weight in the 
deviation sum. 


D 
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Then again the choice as between the standard deviation and the 
mean deviation may be sometimes determined by the particuiai 
kind of average winch suits the problem best. But as the antt- 
rnetic mean is the most important and the most commonly used 
average, so the standard deviation is certainly the most important 

measure of dispersion. 

It will be shoum later that the following relations are approxt- 
matdy true when the distribution of variables is not very far from 

being symmetrical:— 

(1) Quartile deviation=liStandard deviation). 

(2) Mean deviation =\[Stan/iard deviation). 

In (2) the mean deviation should be measured from the mean. 

Also (3) a range of two or three times the standard deviation 
on both aides of the mean will be found to include the majority 
of the observations in the distribution. 

Coefficient of Variation. Before we pass on to illustrate the 
subject of averages and variability by means of a few examp ea 
it is necessary to introduce one more constant known as 
efficient of variation. It is a measure of variability but 
from the chief measures already discussed in that they are absolute 
measures, whereas the coefficient of variation, written C. of ^ 
short, is a ratio or relative measure. The need for it arises w en 
we reflect that in order to gauge fairly the amount of variabihty we 
ought to have in mind also the size of the mean from which t © 
variation is measured ; just as a difference of 1 foot between t e 
heights of two men is a conspicuous difference when the norma 
height is between 5 and 6 feet, whereaa the same difference of 1 foot 
between two measured miles would be trifling because the standar 
mile contains over 6000 feet. 

The coefficient of variation has been defined by Karl Pearson 
{Phil Trans., vol. 187 a p. 277), who first suggested its use, as ‘ 
percentage variation in the mean, the standard deviation (S.D.) 
being treated as the total variation in the mean,’ so that 

C. of V. = 100 S.D./Mean. 

He pointed out that it would be idle, in dealing with the variation 
of men and women (or indeed very often of the two sexes of any 
animal), to compare the absolute variation of the larger male organ 
directly with that of the smaller female organ, because several of 
these organs, as well as the height, the weight, brain capacity, etc., 
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are greater in man than in woman in the approximate proportion 
of 13 : 12. 

As an example of the use of the C. of V., figures may be quoted 
from a paper by R. Pearl and F. J. Dunbar {Biometrika, vol. ii. 
pp. 321 et seq.). On Variation and Correlation in Arcella. Measure- 
ments in mikrons were made of the outer and inner diameters of 
604 specimens of a shelled rhizopod belonging to the group Imper- 
forata, family Arcellina, with the following results, to tw’o decimal 
places :— 


1 

Mean. 

1 

S.D. 

0. of V. 

1 

Outer diameter . 
Inner „ 

65-79 1 

15-91 

6-73 

2-17 

10-27 per cent. 
13-68 „ 


Thus, judging by the S.D. column, giving the absolute size of 
deviation, the outer diameter would appear to be more variable 
than the inner, but the C. of V. column shows that, if we take the 
sizes of the two diameters into account, the inner is really the 
more variable of the two. To turn aside the edge of possible criti¬ 
cism it should be added that the authors also give the errors to 
which the above measures are subject, as unless these are known 
we cannot tell whether the differences observed in variation are 
significant or not of a real difference in fact, but that question 
must be left until the theory of errors due to sampling has been 
developed in a later chapter. 

The C. of V. varies considerably for different characters. W. R. 
Macdonell states that ‘ 3 to 6*5 are representative values for varia¬ 
bility in man, while in plants it may run to 40.’and Pearson and others 
have shown that for stature in man it varies from about 3 to 4 
and for the length of long bones from 4 to S. 
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FREQUENCY DISTRIBUTION : EXAMPLES TO ILLUSTRATE 
CALCULATING AND PLOTTINa t SKEWNESS 

Calculation oS Mean and Standard Deviation. Example 
return now to the examination example in order to show how the 
labour of calculation in finding the arithmetic mean and standard 
deviation of a frequency distribution may be somewhat lessene . 

The various steps in the process appear in Table (14). n 
first column the marks at the middle of each class-interval have 
been written down, and we make the assumption that aU ^he can 
dates in any one class have the same number of marks, name y, 
marks at the middle of the class-interval. In any case w ere 
number of observations is large, and where the class-intervals ar 
reasonably smaU, the errors resulting from such an assumption wi 
be insignificant, because the individuals in each class are jus 
likely to have values above as below the value at the middle o 
class-interval. and they wUl therefore compensate for one another. 

We now seek to alter the scale of marking so as to produce 
simpler set of marks than the original, which will make the wor 
of finding the mean also simpler, but we must not forget a 
end to change back again to the original scale. We choose a num er 
from col. (1), somewhere near the required mean, to act as a km 
of origin from which to measure the other numbers in the column. 
This choice is only a rough guess, and it is really immaterial w c 
number is selected as origin, except that the nearer it is to e 
mean the lighter will be the calculation to foUow ; the number H 

has been selected in this instance. ^ 

In col. (2) are written down the deviations of the marks in each 
class from 33, so that now some candidates appear as if they were 
S, 10, 16 . . . marks to the bad, and others as if they were 6, 1 . 
15 ... to the good. So long as we remember to add 33 at t e 
end we can content ourselves therefore by finding the mean of t e 
marks as given in col. (2). But these again can be further simpli^ 
by dividing each candidate’s marks by 6, and we then only n®®® 
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fco find the mean of the marks as shown in col. (3), so long as we 
remember to multiply by 5 at the first step back to the old scale 
of marking. The addition of col. (5) makes it easy to calculate 
this mean, for it gives the result of multiplying each value of the 
variable (the number of marks in each class) by its appropriate 
weight (the number of candidates who obtained that number of 
marks). 


Table (14). Marks obtained by 514 Candidates in a certain 
Exaaiination—(Analysis of Method fob Calculating 
Mean and Standard Deviation). 


(1) 

(2) 

(3) 

(4) 

(») 

(•) 

Marks OD old 
icaU. 

Deviation of 
Nos. in CoL(l} 
from 33. 

Marks on 
new scale. 

Frequency ; 

of i 

Candidates. 

Product of 
Nos. in 
Cols. (3)&(4). 

Product of 
Nob. in 
Coli.(3)A(5). 

1 

3=33-30 

-30 

(X) 

-6 

if) 

6 

(A) 

- 30 

{A=) 

180 

8=33-25 

-25 

-6 

9 

- 45 

225 

13=33-20 

-20 

-4 

28 

-112 

448 

18=33-15 

-16 

-3 

49 

-147 

441 

23=33-10 
28=33- 6 

-10 
- 6 

-2 

-1 

68 

82 

-116 
- 82 

232 

82 

33=33 


• « 

87 

« • 

• . 

38=33+ 6 

+ 6 

+ 1 

79 

+ 79 

79 

43=33 + 10 

+ 10 

+ 2 

60 

+100 

200 

48=33+16 

+ 16 

+3 

37 

+ 111 1 

333 

63 = 33 + 20 

+ 20 

+4 

21 

+ 84 

336 

68=33 + 26 

+ 26 

+ 6 

6 

+ 30 

160 

63=33 + 30 

+30 

+ 6 

3 

+ 18 

108 

o • 

0 o 

• • 

614 

-110 1 

1 

1 

! 2814 

1 


Thus, on this new scale, the mean marks obtained are 

_6(—fl)+»(-6)+28(—4)+ . . . +87(0)+ ... +6(+6)+3(+6) 

6U 

_-632+422 

614 

-no 

“ 614 
=-0214. 
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This, then, is the mean of the marks obtained by the candidates on 
the scale indicated in col. (3). If the marks are on the scale given 
in col. (2), the mean is 5(-0-214), i.e. -1-070. To bring them back 
to the original scale as in col. (1) we must add 33 to this result, so 
that the required arithmetic mean 

= 33+5C-0-2I4) 

=33-1 070 
=31-93. 


To find Ike Standard Deviation, or the root-mean-square deviation 
from the arithmetic mean, it is convenient as before to work with 
the simplified scale, to measure the deviations from the arbitrary 
origin (33) associated with that scale, and to make the necessary 

corrections at the end of the work. 

Col. (5) in Table (14) gives the deviation multiplied by the 
frequency in each class, the frequency denoting the number of 
times the particular deviation occurs. Hence, if these numbers be 
multiplied again by the numbers in col. (3), we shall have each 
separate deviation squared and multiplied by its frequency. Th© 
results are shown in col. (6), and they must be added, and their 
sum divided by the sum of the frequencies (514), to give the mean- 
square deviation, which we may represent by s*. 

Thus ««=2814/514 

=5-475, 

and this is the mean-square deviation referred to 33 as origin. 
We require the corresponding expression referred to the mean, 
31-93, as origin. If we denote this by there is a simple relation 
connecting the two, namely, 

where £ is the deviation of the mean itself from 33 [see Appendix, 
Note 6]; of course 8^, 8, and £ are all to be measured on the same 
scale, the simplified scale adopted with 5 marks as unit. 

Now we have already shown that the deviation of the mean from 
33=—0-214, and this is therefore the value of x. 

«„»=6-476-(-0-214)* 

=6-476- 0 046 
=5-429 
= (2-33)*. 


Hence 
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And, returning to the old scale, the standard deviation, usually 
denoted by a 

=5(2-33) 

= 11-65. 

We notice that 3a=34-95, and this range on either side of the 
mean amply takes in all the observations. 

The. mean deviation is readily found from Table (14) by adding up 
the numbers in col. (5) regardless of sign and dividing by the sum 
of frequencies, 514. 

Thus, on the new scale, the mean deviation 

_ 964 

— TTT 

= 1-856, 

which, on the old scale, becomes 5(1-856) or 9-28. This, however, 
is the mean deviation measured from 33 as origin, and a correction 
has to be applied to get the mean deviation measured from the 
median or from the mean. 

To get the mean deviation from the mean we note that the 
difference between the mean, 31*93, and 33 is 1-07. Hence it 
should be clear from Table (14) that, by measuring from 33 instead 
of from 31*93, we have made the deviations of all the marks from 
33 upwards too little by 1-07, and we have made the deviations of 
all the marks from 28 downwards too much by 1*07. Hence, to 
get the deviation required we must add to 9*28 an amount 

=-m[l-07(87+79-l- . . . +3)-1-07(82+68+ . . . +6)] 
1-07 

(283-231) 

614 ' 

107 

=-X62 

514 

=0*108. 

Therefore, the mean deviation measured from the mean=9‘39. 
This may be compared with {(standard deviation)=9-32. 

Also the quartilc deviation for this distribution has been shown 
to be=8*02, and it may be compared with Kstandard deviation) 
=7*77. 


Plotting ol a Frequency Distribution. The data for the two 
examples which follow are taken from the Quarterly Return of 
Marriagea, Births, and Deaths, No. 201, issued by the Registrar- 
General. 



56 


STATISTICS 


The first shows the proportion to population of cases of infectious 
disease notified in 241 large towns of England and Wales for the 
thirteen weeks ended 4th April 1914. This proportion was given 
for each tow-n separately in the Return, but, in order to bring out 
the distinctive features of the distribution, the several towns have 


Table (16). Proportion to Population of Cases of Infectious 
Disease notified in 241 Large Towns of England and 
Wales during the Thirteen Weeks ended 4th April 1914. 


per lOOU 
persons 
living. 

Eftcb 4ot l>«low represents One Town with NotiAed fUte of Infectious Disesse 
between Uniits sa given in previous column. 

Totftl Ko. 
of Towns 
with given 
Rstc. 

0— 

• • • * 1 

5 

2— 


39 



69 

8— 


41 

8— 


29 

10— 


22 

12— 


16 

14— 


7 

IS¬ 

•«• • 1 

6 

IS— 

a •• 

3 

20— 

• • « « 

4 

22— 


0 

24— ' 


0 

28— 

• 

1 



241 


been, in Table (16), represented by dots and put into different classes 
according to the proportion of infectious cases notified in each, 
with a separate line for each class: e.g. if the proportion for any 
town was 6*37 a dot was placed in the line corresponding to the 
class of towns for which the rate was * 4 and less than 6.’ Every 
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fifth dot in each Une was ticked off. so as to make them easy to 

count up and also to keep the lines, down the paper as well as 

across, straight. The frequency, i.c. the number of dots in each 

class, was then recorded in a column at the extreme right-hand 
aide of the paper. 



Rate of Diacaae per WOO peraona living 
Fio. (1). 


It wUl be at once seen that this procedure, without calculating 
jmy aver^^, etc., ultimately givea to the eye a very good picture 
the d.etnbut.on. and mdeed it ia the baeie of the graphical Lthod 
of itod^ statiatica. In drawing a proper graph we uee a apeciaUy 
ruled sheet of paper which is divided up into a large nu^r of 
equal small squares by ‘ horizontal ’ (cross) and ' vertical • (up-aud- 
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down) lines. This merely enables ue to place our dots accurately 
in position, as shown in fig. (1). where the numbers 0. 5. 10 . - . 
have been marked off along the fine Ox to correspond to case 



Fio. (3). 

rates ’ of these magnitudes: thus rates of ‘ 4 and less than 6 
were recorded by 69 successive dots along a vertical line at a dis¬ 
tance 6 (the centre of the class-interval 4-6) from the axis Oy. 
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The final configuration in fig. (1), when turned half round, ia 
exactly the same as that of Table (15). If desired the frequency 



O 6 10 tS 20 25 30 X 

Bate of Diaeaee per WOO persona fiuing 

Fio. (8). 


^7 be recorded, dot by dot, on a side piece of paper and then 
only the topmost dot in each class need be marked on the graph 
iheet. In order, boweyer, to enable the eye to measure the height 
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of each frequency in relation to the rest, it is advisable in that 
case to connect up adjacent dots as in fig. (2) or as in fig. (3). 

The last method of representation (tig. (3)), to which the name 
histogram has been given by Professor Karl Pearson, is particularly 
useful and should be carefully studied. It is formed in this case by 
erecting a succession of rectangles with the lines 02, 24, 46 . . . 
along Ox as their bases, corresponding to the successive classes of 
the given distribution, and with heights proportional to the fre¬ 
quencies proper to those classes. It is not necessary to complete 
the sides of the rectangles, but, if they were completed, each would 
enclose a number of squares proportional to the frequency of towns 
with the rate of disease defined by its base : e.g. the first rectangle 


would enclose 10 squares, the second 78, the third 138, and so on, 
numbers respectively proportional to 5, 39, 69, and so on. It 
follows that the total area enclosed between the histogram and the 
axis O 2 : is proportional to the aggregate frequency of towns observed. 

Now we might conceive a step further taken and a smoothed 
curve drawn freehand so as to agree as closely as possible Avith 
fig. (2) or fig. (3), but with all the sharp comers smoothed out, and 
so nicely adjusted as to make the area enclosed between the curve, 
the axis Oa:, and lines parallel to Oy defining the limits of any class, 
proportional to the frequency of towns in that class. To this 
fig. (2) and fig. (3) might be regarded as approximating if only a 
sufficient number of observations were recorded, and only in that 
case would it be possible to draw it with any accuracy. Such a 


curve is called a frequency curve, measuring as it does the frequency 


of the observations in difierent classes. 


[Assuming that corresponding to a given frequency distribution a curv 
of this kind does really exist—and the assumption turns upon the frequency 
being continuous—the reader who is acquainted with the notation of the 
Calculus will recognise that, if (x, y) represents any point on the curve, yix 
measures the frequency of observations or measurements of an organ or 
oharaoter lying between the values x and (x-h^x), when the total frequency 
comprises a large number of observations, say 600 to 1000. 

Further, it will appear later that the mean, the median, and the 
have a geometrical interpretation of no small importance associated with the 

curve. . 

The mean x corresponds to the particular ordinate y which passes througn 
the centroid or centre of gravity of the area between the frequency curve 
and axis Ox, because 

the mean= Yt 2(x.y^x)/ Y. 

where the summation extends throughout the distribution, 

=Jxy«ix/Jydx 

where the integraf^fSKK^nrougnout’fhe curve. ^ 

I , K UNtv ’TY'I 
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The median x corresponds to the ordinate y which bisects this same area; 
t,g. in fig. (3), the number of small squares on either side of the median in the 
space bounded by the histogram and the axis represents half the total niituber 
of observations, two small squares correspondins' to each observation. 

The mode x corresponds to the laa.ximum ordinate of the curve, measuring 
the greatest frequency in the whole distribution.] 

Skewness. There is one feature of a frequency distribution which 
catches the eye sooner almost than any other, and that is its sym¬ 
metry or lack of symmetry. It is important therefore that we 
should have some means of measuring it. 

In a symmetrical distribution the mean, mode, and median 
coincide, and we have, as it were, a perfect balance between the 
frequency of observations on either side of the mode or ordinate of 
maximum frequency. In a skew distribution the centre of gravity 
is displaced and the balance thrown to one side : the amount of this 
displacement measures the skewness. But there is another factor 
to be taken into account, for when the variability of the distribu- 
tioQ is great the balance is more sensitive than when it is small, 
and the difference between mean and mode is consequently more 
pronounced though it may not be significant of any greater skew¬ 
ness. This will be clear in the light of the analogy of the swing 
of a pendulum. If OPP' denote the pendulum in the accompanying 
figure, OAA' its mean position, and OBB' an extreme position, the 
displacement in the position OPP' from the mean, ii measured 
along the scale AB, is AP, 
and, if measured along the 
scale A'B', is A'P'. But. 
since the amount of swing 
in either case is the same, 
it would be more appropri¬ 
ate to write the linear dis¬ 
placement as a fraction of 
the full swing so as to make 
these two measures also the 
same, thus 

AP/AB=A'P7A'B'. 

So, in the case of a fre¬ 
quency distribution, Profes¬ 
sor Karl Pearson has suggested as a suitable measure for skewmess, 
not the difference between mean and mode, but the ratio of this 
difference to the variability. Thus 

»kewnes$= (mean— mode)jS.D, 


O 
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or, approximately, 

=3(mean—meciiau)/S.D. (see p. 39), 

a form which is sometimes useful. 

According to this convention the skewness is regarded as positive 




X increasing x increasing 

when the mean is greater than the mode, and as negative when 
the mode is greater than the mean. 

Illustrations of frequency curves, with the position of mode and 
mean marked, will be found in Chapter xvn. 

We proceed to the detailed calculations necessary in the infectious 
diseases example. 


Table (16). Proportion to Population of Cases op Infectious 
Disease notified in 241 Large Towns of England and 
Wales during the Thirteen Weeks ended 4th April 1914. 



(1) 



(a) 

(8) 

( 4 ) 

{8} 

Case lUte per 

1000 persons living. 

Deviation 
from 7. 

Frequency of 
ToNvne with 
given Rate. 

Product of 
Nos. in 
Cols. (2) & (3). 

Product of 
Nob. in 
CoU. (2) 4 (4)- 





(*) 

if) 

ifx) 

(/®*) 

0 and less than 

2 

- 3 

6 

-16 

46 

2 


99 

4 

- 2 

39 

-78 

166 

4 

ff 

99 

0 

- 1 

69 

-69 

69 

6 

»f 

99 

8 

• 4 

41 

» % 

4 • 

8 

$9 

99 

10 

+ 1 

29 

+29 

29 

10 

99 

99 

12 

+ 2 

22 

+44 

88 

12 

99 

99 

14 

+ 3 

16 

+ 48 

144 

14 

99 

99 

16 

+ 4 

7 

+28 

112 

16 

99 

99 

18 

+ 6 

6 

+ 26 

126 

18 

99 

99 

20 

+ 6 

3 

+ 18 

108 

20 

99 

99 

22 

+ 7 

4 

+28 

196 

26 

99 

99 

28 

+ 10 

1 

+ 10 

100 


• m 

241 

1 

+68 

1172 
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Example (2).—^The various averages and measures of variability 
of the distribution can be calculated just as in the case of the last 
example, and the data required to determine the mean and the 
standard deviation arc set out in Table (16). We can afford now 
to miss out some of the more obvious steps in explanation. 

On the scale of col. (2), where a difference of 2 in the case rate, 
per 1000 persons living, is the unit and where a case rate of 7 is 
taken as origin, the mean, by the result of col. (4) 

fl H 

= VIT 
=0-282. 

Hence, on the original scale, the mean 

=7+2(0-282) 

=7-564. 

Again, the mean-square deviation, on the scale of col. (2), measured 
from 7 as origin is 

^ 117 2 

S’=-5TT’ 

=4-863; 

and £, the deviation of the mean from 7 as origin, on the scale of 
col. (2)=0-282. Thus the mean-square deviation measured from 
the mean, 

=4-863-(0-282)* 

=4-783. 

Therefore, the standard deviation a, on the original scale 

=2-\/4-783 

=4-374. 

Since 3a=13-l22, the range ‘ (mean—3a) to (mean+3a) ’ includes 
all but one or two observations. 

To determine the median, we conceive the towns ranged in order 
according to the proportion of infectious cases notified in each, 
from the least to the greatest, and the town with the median rate 
is the 121st from either end. 

But the n3th town has a notified case rate of approximately 6 
per 1000, and the 154th town has a notified case rate of approxi¬ 
mately 8 per 1000. 

Thus a difference of 41 towns corresponds to a difference of 2 in 
the rate, hence a difference of 8 towns corresponds to a difference 
of 0-30 in the rate ; therefore the median fa(e=6-30 approximately. 
By referring to the original records and writing down the rate 
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for each town iu the group ‘ rate 6 and less than 8 in which the 
median lay, the accurate value of the median turned out to be 6'30. 

Thelower quartile or case rate of the imaginary town, No. J(241), 
or GO-25, one-quarter way along the ordered sequence of towns, is 
reatlily shown to be 4-47, and the upper quartile or case rate of 
town No. 1(241), or 180-75, is 9-84. 

Hence the quartile deviation 

= i(9-84-4-47) 

=2 69. 

With this may be compared ^(S.D.)=§(4-37)=2-92, 

Again, the mean deviation measured from 7 

=3-253. 

Measured from the mean, it becomes 

=3-253+*^‘°®^[(41 + 69+39+5)-(29+22+16+7+5+3+4+l)] 

241 

=3-253+(0-5G4)(67)/241 
= 341 

and this may bo compared with t(S.D.)=t(4-374)=3-50. 

If we estimate the mode by inspection of the frequency graphs U) 
figs. (2) and (3), we should say it comes between 6 and 6 ; supposing 
we call it 6-6, very roughly. 

In this case, taking the values actually calculated for mean and 
median, 

(mean—mode)=7-56—6-50 

=2-06, 

and SCmean—mediaa)=3(7-56—6-39) 

= 3(117) 

=3-51 : 

so that the rule 

(mean—mode)=3(mean—median) 

is far from being true according to these results ; this is partly due, 
of course, to the very unsymmetrical character of the distribution. 

The relative positions of the mean, median, and modal points 
as calculated are indicated in figs. (2) and (3) by three lines drawn 
parallel to Oy through these points to meet the graph. 

Finally, (mean—mode)/S.D.=2-06/4-37=0-47. 

Example 3.—^The next example deals with the deaths of infanta 
under one year, out of every thousand bom, in 100 great towns in 
the United Kingdom during the thirteen weeks ended 4th April 1914. 
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The details of the calculation may be left in this case to the reader, 
who is recommended to follow the method shown in the last example 
so far as possible throughout, including the plotting of the distribu¬ 
tion in difierent ways. The statistics are as foUowa 


Table (17). Death Rate of Infants under 1 Year 

PEE 1000 Births. 


Death Bate. 

% ^ 

No. of Towns 

with Death Kate 


ae in Col. (1). 


30 and under 40 
50 .. 60 


60 

70 

80 

90 

100 

110 


70 

80 

90 

100 

no 

120 


Death Rate. 


120 and under 130 
130 .. 140 


140 

150 

160 

170 

200 

240 


150 

160 

170 

180 

210 

250 


No. of Towjia 
with Death Kate 
as in Col. (3). 


The more important results are :_ 

Arithmetic mean=118-9 ; S.D.=32-2 ; 

median= 120-9 ; quartUe deviation=19-5. 

Example (4).—As another example corresponding details may bo 
worked out for the following temperature records taken at noon 
at a certain spot in Chester week by week during a period of time 
covering five years, the results in this case being 

mean=6510; S.D.=10-33; 
median=54-88; quartile deviation=7-94 

Table (18). 257 Weekly Records of Tempkbatube (Fahrenheit). 

(^) <31 it\ iA\ 


Temperature 

Li mi It Id 
Degreee. 

> r 

No. of Reccrde 
between Limiu 

•howDiaCol.(l) 

Temperature 
Limite in 
Degree*. 

No. of Recorde 
between Limitt 
•bowniDCcl. (3) 

25-5-29-6 

1 

63 5-S7-5 

30-6 

29-5-33-6 

1 

67-6-61-6 

31-6 

33-6-37-6 

9 

6I-6-65-6 

30 

37-6-ilfi 

11-5 

65-6-60-6 

26 

4I-6-45-6 

28 

69 6-73-6 

13-6 

4fi&.m>0 

31-6 

73-6-77-6 

4 

49-6-63-6 

36-6 

77-6-81-6 

3 
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Before closing the chapter a slightly different manner of graphing 
the statistics is worth noticing, as it provides us with a fairly quick 
though rough alternative method of determining the mode and 
median. 

Take, for example, the examination marks data which for this 
purpose must first be thrown into the second form shown below 
Table (7). We mark off on some convenient scale along OX dis¬ 
tances 5, 10, 15, 20 ... 65 from O to represent these numbers 
of marks respectively, and at the points obtained we erect lines 
parallel to OY of lengths 5, 14, 42, 91 . . . 514 to represent the 
numbers of candidates who obtained not more than 6, 10, 15, 20 

. . 65 marks respectively. A freehand curve is then dravTi 
through the summits of these lines in the manner indicated in 
fig. (4), starting from a height 5 and rising to a height 614 above 
the axis OX. It is called an ogive curve. 

By means of this curve we can approximately state at once how 
many candidates obtained any given number of marks or less. 
Suppose, for example, we wish to know how many candidates 
obtained 22 marks or less, we have only to measure off a distance 
22 from O, represented by ON, and erect a perpendicular NP to 
meet the curve at P. Since NP=110 we infer from the manner in 
wliich the curve has been formed that 110 candidates obtained 
22 marks or less, so that, incidentally, the 110th candidate from 
the bottom must have obtained approximately 22 marks. This 
suggests that by working backwards we can also read off roughly 
the number of marks gained by any particular candidate when his 
order in the list is known. Thus, to find the median, i.e. the marks 
due to candidate No. 257*5, we merely draw a line parallel to OX 
at a height 257*5 above it and the portion of this line cut off between 
the curve and OY measures the median. The value given by this 
method is approximately 31*6. Similarly the quartiles are found 
by drawing lines parallel to OX at heights 128*5 and 385*5 above 
it with results about 23*3 and 39*2 respectively. 

Again, as we gradually increase the number of marks, the number 
of candidates getting that number of marks or less must increase 
also, but the rate of this second increase is variable. The reader 
will perceive that where the height above OX changes slowly the 
gradient of the curve is small, but where it changes by big steps 
the gradient is steep, and it is at its steepest just in the neighbour¬ 
hood where the greatest addition is being made to the height as 
the marks increase, i.e. where the frequency of additional candi¬ 
dates is at its greatest, so deter mining the mode: this should bs 
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clear on a comparison of the two arrangements of the data in and 
below Table (7). By sliding a straight-edge along tlie contour of 
the curve we can estimate approximately where the curve is 
steepest, for at this point the direction of turning of the ruler or 



more than any given Number of Marks. 

straight-edge must change. This gives for the mode a value in the 
neighbourhood of 32. 

It might be advisable to treat the other examples by tiiis method 
•Iso, so as to compare results. 
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From the mathematical point of view graphs may be regarded aa 
the alphabet of Algebraical Geometry. 

We can locate a point in a plane, relative to two perpendicular 
Lines or ases as they are called, OX, OY, which serve as boundaries 

of measurement, when we know y and x, 
its shortest distances from these boun* 
daries. This fact serves to connect up 
Geometry, in which points are elements, 
with Algebra, in which x’s and y Si 


Y 

1 

P 

X 


0 

X 


— — — 

ments. The names abscissa {ab —from, 


refer to them together, they may be spoken of as the co-ordinates of ?• 
The celebrated French philosopher, Descartes (1596-1650), was 
the founder of Cartesian Geometry, and if we may venture to com¬ 
press the essence of his system into a single statement, it is this 
When a point P is free to take up any position in a given plane, 
its X and y are quite independent: they may be allotted any values 
irrespective of one another. Suppose, however, that P is constrained 
to lie somewhere on an assigned 
curve, such as APB in the figure, 
then X and y are no longer inde¬ 
pendent, for, so soon as x is fixed, 
y is fixed also; it follows that in 
this case some relation, algebraical 
or otherwise, such as y=x?—2x-\~T, 
must exist between x and y, and the relation may be called the 
equation of the curve which gives rise to it. 

Now, if to every curve there corresponds in this way some 
equation and to every equation some curve, it seems likely that the 
simpler the curve the simpler will be the corresponding equation, 
and vice versa. In fact, the student who does not know it already 
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need only refer to the most elementary treatise on graphs to find 
that every equation of the first degree in x and y, i.e. one which does 
not involve any x®, y*. xy, or higher powers, represents some straight 
line. Any such equation, e.g. 


x-3y+12=0, 


can be at once thrown into Y 

either the form 


( 1 ) 


X 


y 


= 1 . 


-12 ■ 4 

where —12 and 4 are intercepts 
made by the line on the axes 
OX and OY ; or 

(2) y=ia:+4, 



where J, t.e. 1 in 3, is the measure of its gradient and 4 the height 
above the origin at which it cuts the axis OY. 

Further, every equation of the second degree in x and y, which 
may involve x*. y*, and xy, but no higher powers, represents geo¬ 
metrically some conic, a family of curves comprising the parabola, 
the ellipse, and the hyperbola, with the circle and two straight 
lines as particular cases. The earth and other planets, likewise 
comets, in their journeys through space travel along curves belonging 
to the same family^ one of ancient and historical connections. 

These conics need not, however, detain us, and we pass on at 
once to an example of a cubic graph to show how a very little 

knowledge of the theory may be put 
to some practical use. Suppose a 
box manufacturer has a large number 
of rectangular sheets of cardboard, 
3 ft. long by 2 ft. broad, and he 
wishes to make open boxes with them 
by cutting a square piece of the same 
eize out of each corner and turning 

lTh» ahudod flapa ar* bent upwards up the fiaps that are left. How big 
•long the dotted imea.) , ,, , , •* xi.- • *. u 

should the squares be if this is to be 

done with as little waste as possible 1 Clearly this is commercially 

Ml important type of problem to solve. 

I^et os denote a side of the square to be cut out of each comer 

l*y * feet. Then the bottom of the required box will have dimension* 

e 

(3-2x) ft. by (2-2x) ft. 
ud ita depth wiU be x ft. 
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Hence the capacity of the box when completed will be 

x{Z—2x){2~2x) cu. ft., 

and he makes best use of the material who produces the most 
capacious box. Call this expression y and let us find the values 
of y corresponding to different values of x so as to be able to draw 
roughly the curve of which the equation is 


y=x(Z-2x){2-2x) . . . (1) 


Table (19). Table of Corresponding Values of x and y 

IN THE Curve y=z{Z~2x){2—2x). 


9 

2x 

(3-2x) 

(2-2x) 

*(3-2x)(2-2*) 


- 1 

-2 

6 

4 

-20 

-20 

- 1 

-1 

4 

3 

- 6 

- 6 


-i 

i 

s 

-H 

- 219 

0 

0 

3 

2 

0 

0 



1 

§ 

+H : 

+ 0-94 

+ 1 

+ 1 

2 

1 

+ 1 1 

+ 1 

+ 3 

+ ? 

f 

i . 

+A 

+ 056 

+ 1 

+ 2 

1 

0 

0 

0 

+ U 


i 


” 

- 0-31 

+ 4 

+ 3 

0 

- 1 

0 

0 

+ 2 

+ 4 

- 1 

-2 

+ 4 

+ 4 

+ 2i 

+ 5 

-2 

-3 

+ 16 

+ 15 

1 

0-2 

0-4 

2-6 

1-6 

(0-2)(2-fi)(l-6) 

0-83 

0-4 

0-8 1 

2-2 

1-2 

1 

(0-4K2-2)(1-2) 

1-06 

0-6 


1-8 

0-8 

(0-6)(l-H)(0-8) 

0-86 

0-8 

H 

1-4 

0-4 

(0-8)(l-4)(0-4) 

1 

0-45 

0-38 

0-76 

2 24 

1-24 

(0-38)(2-24Kl-24) 

1-055 

0-39 

0-78 

2-22 

1-22 

(0-39)(2-22Xl-22) 

1-056 

0-40 

0-80 

2-20 

! 1-20 

(0-40)(2-20)(l-20) 

1-056 

0-41 

0-82 

2-18 

M8 

(0-41)(2-18){M8) 

1-066 


We get a tolerably good idea of the shape of the curve by plotting 
the points {x, y) shown in Table (19) from —| to x=-\-2 as in 
fig. (5). It is simply a matter of practice to be able to determine 
the whole curve from a few points in this way, and the greater the 
number of points plotted the more accurately will it be possible 
to draw the curve. It should be noticed that the points for which 
y=0 are in a sense key-points to the curve: they are readily 
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0-25 0-50 0-76 J-00 X 

Length of Side of Square cut out 
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found by making the factors separately zero in the right-hand side 
of equation (1), namely x=0, 3—2.r=0. and 2—2x=0, and by 
plotting them first they serve as a guide to the position of points 
subsequently plotted. 

We want to know for what value of x the cajmcity of the box, y, 
is greatest and the preliminary plotting is enough to indicate a 
maximum value for y between and x=l, for the curve first 
rises and then falls between these two limits. In order to discover 
more exactly where the maximum is located we therefore plot 
in addition the points corresponding to x=0-2, 0-4, 0-C, 0-8 respec¬ 
tively, and this is done on a larger scale than that used in the 
first diagram because the accuracy is thereby increased (see fig. (5) 
inset). 

The calculations and figure suggest that the maximum required 
is very near the point for which x=0-4, so we next work out values 
of y in this neighbourhood, corresponding, say, to x=0-38, 0-39, 
0-40, 0-41, with the results shoum at the foot of Table (19). From 
these we conclude that to a fair degree of accuracy the maximum 
value of y is given by taking x=0-395. It would be possible in 
the same way to calculate more decimal places, but we have gone 
far enough to make the method clear. 

Hence the side of each square cut out should be of length 

0-395 ft., or 4J in. 

Whenever the value of one variable, y, depends upon that of 
another variable, x, in such a way that when x is given y is known, 
so that y may be termed a function of x, corresponding values of 
X and y can be plotted—as was done in the example just discussed— 
and a curve drawn by joining up the points obtained, the relation 
which cormects x and y being the equation of this curve. More¬ 
over, it is possible, by calculating enough points from the equation 
and plotting them, to get the curve as accurately as we please. 

In Statistics, however, we usually have to start the other way 
round and reach the equation, if at all, last. We make observations 
of two sets of variables, a set of x’s, and a set of y’s, one of which 
is dependent in some way upon the other — e.g. y, the dependent 
variable, might denote the number of individuals observed to have 
a certain organ of length x, the independent variable—and thus 
we get pairs of corresponding values like (Xj, y,), (Xg, y,), (Xj. y^) . . ■ 
We met with examples of this method of recording results in the 
last chapter, and we need only repeat here that its chief virtue is 
suggested in the root of the word itself—it is more graphic than a 
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iong table of figures and, by means of it, many of the essential 
features of a problem are immediately seized upon. 

Now for some purposes it may be necessary to go further and 
to find what curve would best fit the points plotted, assuming they 
were numerous enough, and what equation between x and y would 
best describe the curve. But the graphs we meet in Statistics, 
bearing, for instance, upon sociological or biological problems, are 
in general much more wayward than the mathematical kind we 
have referred to in the present chapter : it is impossible to set 
down simple equations to which they can be rigidly confined, and 
when we are unable to find any relation which accurately and 
uniquely defines ^ as a function of x we must rest satisfied with the 
most manageable equation and the best fit we can get. 

In sciences such as Engineering and Physics it is often possible 
to fix upon two mutually dependent variables, x and y, and to 
observe enough corresponding values of each to enable us to draw 
a graph which answers very closely to the true relationship between 
them, so that a connecting equation can be determined ; e.g. we 
niay plot the amount of elastic stretch, y, in a wire when difierent 
Weights, X, are hung from the end of it, and it is found that y is 
directly proportional to If we deal in this way with some 
simple figures which are amenable to our purpose it may help to 
make clear the nature of the same problem in Statistics. 

The following corresponding values of x and y were given in a 
Board of Education Examination (1911):— 

*=1-00, 1-50, 2 00, 2-30. 2-50, 2-70, 2-80 ; 
y=0-77, 1-05, 1-50, 1*77, 2 03, 2-25, 2*42. 

Allowing for errors of observation, it was desired to test if there 
was a relation between y and x of the type 

y=a-\-bx^ . . . (1) 

In the first place, the shape of the curve obtained by plotting 
y against x, as in fig. (6), would, to the initiated, probably suggest 
ft parabola, the equation of which is of type (1). In order to test 
itft suitability we proceed to plot y against **, or, putting x^=^, we 
plot y against If equation (1) holds, then, in that case 

y=a+6^ . . . (2) 

should also hold, and this, in (^, y) co-ordinates, represents a straight 
line. The result of plotting y against ^ should therefore be a 
number of points approximately in a straight line—we say ‘ ap¬ 
proximately to allow for errors of observation in the original data. 
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Now irom the given statistics corresponding values of ^ and j 
are, since :— 

|=-l-00, 2-25, 4-00, 5*29, C-25, 7-29, 7-84 ; 
y=.0-ll, 1-05, 1-50. 1-77, 2-03, 2-25, 2-42; 



Fia. (6). 

and the resulting graph, fig. (7), is very approximately a straight 
line. To determine its equation, choose two points (not too close 
together) on the line, which has been drawTi so as to run as fairly 
as possible through the middle of the points plotted, and, in choosing, 
take points which lie at the intersections of horizontal and vertical 
cross lines (the printed lines of the graph paper) if such can be 



found, because their x’s and y’s can be read off with ease and 
accuracy. Two such points are 

(2*8, 1-2) and (6*0, 2-0), 
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and since each of these points lies on the line whose equation is 

t/=a+6f 

we have 

1- 2=a+6(2-8) 

2- 0=a4-6(6-0). 

Subtracting, we get 

0 - 8 = 6 ( 3 - 2 ). 

Therefore 6=i. 

Hence a=2—x=h- 

Thus the equation of the line is 

y=i+i?t 

».e. 4y=|+2, 

and the law connecting x and y is therefore 

4y=z*+2. 


The following statistics, the result of an experiment In Physics 
to verify Boyle’s Law, may be treated in the same way. a: is a 
number proportional to the volume of a constant weight of gas in a 
closed space, and y is a number proportional to its absolute pressure. 
Corresponding values of x and y observed were :— 

(x= 46-89 41-96 40-33 38-88 37-37 36-06 34-71 33-47 

\j/= 76-32 86-38 88-93 92-36 96-09 99-61 103-51 107-61 

(x= 32-39 31-08 29-97 28-76 27-26 25-32 24-04 

|y=lll-09 116-69 120-06 126-08 131-99 142-09 149-81. 

Boyle’s Law states that the product xy is constant, and this may be 
tested by putting ^=- and plotting y against f ; the points obtained 

SC 

should be approximately in a straight line. 

Now in Statistics, as we have already explained, the exact con¬ 
nection between the variables, x and y, is rarely so clear, though 
the absence of law is not so complete as it might seem at first sight. 
At this stage, however, we need not enter into the difficult question 
of curve fitting : if drawn with care and used with judgment much 
that is of value may be learnt by simple plotting and by connecting 
np the resulting points by straight lines or a freehand curve. We 
shall briefly explain or illustrate by examples how graphs and 
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graphical ideas may be used to serve three distinct purposes, 
namely :— 

(1) to suggest correlation or connection between two different 

factors or events ; 

(2) to supply a basis for finding by interpolation some values of a 

variable when others are known ; 

(3) as pictorial arguments appealing to the reason through the eye. 

We reserve (2) and (3) for the next chapter and proceed at present 
with an example of (1). 

Correlation suggested by Graphical means. Consider the index 
numbers, col. (2) Table (20), showing the variation from year to 
year in wholesale prices betw’een the years 1871 and 1912. It is 
not an easy matter to take in satisfactorily the meaning of such a 
mass of bare Ggures, but they are much easier to grasp when plotted 
in a graph. 

In this case the numbers ar, representing years, and the numbers y, 
representing prices, are measures of things of quite a different char¬ 
acter, so that it is not necessary to take the x and y units of the 
same size. Moreover they need not, in a case of this kind, neces¬ 
sarily vanish at the origin, but it is convenient to draw the graph 
in such a way that it shall occupy the greater part of the space at 
our disposal. Thus, we have roughly 80 small squares across the 
breadth of our graph paper, and between 1871 and 1912 we have 
roughly 40 years ; we therefore take two sides of a square to 1 year 
and mark off the years 1870, 1875, 1880, . . ., along an axis or 
base line parallel to the breadth of the paper, as shown in fig (8). 
Again wo have roughly 70 small squares in the available space 
from this base line to the top of our graph paper, and the whole¬ 
sale price index numbers vary from 88-2 to 151*9, a range of 63*7 ; 
we therefore take one side of a square to correspond to a difference 
of 1 in the price index number, and mark off the prices 90, 100, 
110, .... along an axis parallel to the length of the paper, as 
shown in the figure. 

We then plot points to represent the numbers in col. (2) of 
Table (20). Thus, in 1880 wholesale prices stood at 129; we there¬ 
fore travel along the width of the paper till we reach 1880 and 
then upwards until we are opposite the 129 level on the axis of 
prices, inserting a dot to mark the position. Similarly for all other 
points, and the required graph is given by joining them up in 
succession. 
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Fable (20). Marriage Rate and Wholesale Prices 

Index Numbers. 


(1) 

(3) 

(3) 

(4) 

(6) 

(«) 

(7) 



Nine Ycare* 

Difference be* 

At 

Nine Vearsi' 

Difference be¬ 

Tear. 

Tricee. 

Average I 

tw«en Nos. in 

Marriage 
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tween Nos. in 



of Pricati. 

Cols. (2)&i3). 

rate. 

Marriage rate. 
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1871 
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# e 

% « 
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a 4 

1872 

145-2 


% % 
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4 • 

• 4 

1873 
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• « 
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e a 

1874 
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• • 
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» a 

1875 
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+ M 
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1876 

137-1 
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-1-5 
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+ 3 

1877 
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+39 

157 
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- 2 

1878 

131-1 

133-8 

: -2-7 

152 

157 

- 5 

1879 

125-0 

131-5 

1 -6-5 

144 

155 

-11 

1880 

129-0 

128-6 

+ 0-5 

149 

153 

- 4 

1881 

126-6 

125-2 

+ 14 

151 

151 

a s 

1882 

127-7 

120-8 

+ 6-9 

155 

149 

+ 6 

1883 

125-9 

117-2 

+ 8-7 

155 

148 

+ 7 

1884 

1141 

114-7 

-0-6 

151 

148 

+ 3 

1885 

107-0 

111-8 

1 -4-8 

145 

149 

- 4 

1886 

101-0 

109-2 

-8-2 

142 

140 

- 7 

1887 
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-8-1 

144 
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-2-4 1 
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+2-3 
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149 

+ 6 
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106-9 


+7-0 

156 

150 

+ 6 
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101-1 
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+ 3 

1893 

09-4 

97-4 

+ 2-0 

147 
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- 6 

1894 

93-6 

96-3 

-2-8 
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- 6 

1895 

90-7 
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-4-3 

150 
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- 6 

1806 
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94-3 

-6-1 
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156 

: + 1 

1897 

90-1 

93-8 

-3-7 

160 
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+ 3 

1898 

93-2 

93-4 

-0-2 

162 

158 

+ 4 

1899 

92-2 

03-8 

-16 
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159 

+ 6 

lOOU 

100-0 


+ 5-3 

160 

159 

+ 1 

1901 

98-7 


+ 1-0 
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159 
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96-4 

96-9 

-0-6 
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158 

+ 1 
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96-9 

98-3 

-14 
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158 

- 1 

1904 

98-2 

99-5 

-1-3 

153 

156 

- 3 
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97-6 

100-0 

-2-4 
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155 

- 2 
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-0-5 
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+ 3-2 
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It is comparatively easy from this graph to trace the change 
in prices from year to year and from decade to decade : for example, 
we note that from 1873 to 1896 the tendency of prices was on the 
whole downward, and from 1896 to 1910 the tendency was upward. 
Also on the assumption—not necessarily valid—that prices have 
varied continuously, or at least consistently, during the intervals 
between the dates to which the records refer, it is possible to read 
off intermediate values from the graph : e g. midway between 1883 
and 1884 we get the figure 120 as the index number for prices. 

On the same graph sheet we have also plotted the marriage rate 
from year to year during the same period. The numbers are given 
in col. (5) of Table (20). This rate varies from 142 to 176, a range 
of 34, and we have a range of 40 small squares at our disposal in 
plotting ; a difference of 1 in the marriage rate has therefore been 
taken to correspond to one side of a square, and the marriage rates 
140, 150, 160 . . . are accordingly marked along the axis perpen¬ 
dicular to the same base line as before, which is used again to 
measure the passage of years, but the second graph is drawn below 
the line whereas the first was drawn above it. In this way we 
are able to compare the two graphs, namely, the one registering 
the change in prices and the one registering the change in marriage 
rate from year to year. 

It is interesting to observe that the two seem to be not uncon¬ 
nected : they go up and down almost in the same time, and moun¬ 
tains and valleys in the one correspond roughly to mountains and 
valleys in the other ; in other words, there is some kind of correlation 
OT reciprocal relation between them. Now these mountains and 
valleys are largely the result of what may be called short-time 
fluctiiations, and it is important to distinguish between these changes 
which are transient and the more permanent or long-time changes. 
In order to get rid of the former, which sometimes conceal the 
latter, the following device has been adopted : noticing that the 
wave period, the length of time taken for each complete up-and- 
down motion, is one of about nin e years, nine-yearly averages have 
been taken of the figures for wholesale prices right down col. (2) 
of Table (20); thus 139*3 is the average of the index numbers from 

1871 to 1879 inclusive, 138*6 is the average of the numbers from 

1872 to 1880 inclusive, and so on, the results being recorded in 
col. (3). When the points corresponding to these numbers are 
plotted we get the broken line in fig. (8) passing through the body 
of the original graph of prices and indicating its general trend in 
the course of years as separated from the temporary fluctuations. 
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The same procedure has been followed with the marriage rate 
statistics; the nine-yearly averages are shown in col. (6) of Table (20), 
and their graph appears as a broken line passing through the body 
of the original marriage rate graph in fig. (9). 



Suppose we wish on the other hand to study the short-time 
fluctuations as distinct from the " secular trend,’ we may do so 
by forming the differences between the numbers for each year 
and the corresponding nine-yearly averages, and plotting these 
diflerences on convenient scales. 

The numbers obtained in this way are recorded, with their proper 
signs—positive if above the average, negative if below—in cols. (4) 
and (7) of Table (20), and the graphs of these differences are drawn, 
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one below the other for comparison, on the same graph sheet 
(fig. 10). The agreement in fluctuation from the average between 
the two factors, marriage rate and prices, is more easily remarked 
now than it was in the original graphs. High prices go as a rule 
hand*in»hand with prosperous times, and such times load to more 
frequent marriages. This statement must not be taken to imply 
that when prices are high the times are always necessarily pros¬ 
perous for the community as a whole : the lie direct would be given 
to such an implication by any one who had experienced abnormal 
war conditions. 

After about 1892, while the fluctuations continue to be similar, 
a tendency appears for the marriage rate graph to reach each 
extreme point about a year in advance of the other, as though an 
increase in marriages raised prices and a decrease lowered them. 
There is no doubt that any economic change, especially if it takes 
place on a large scale, will set up a system of corresponding forces, 
sometimes in unexpected directions, actions and reactions succeed- 
ing one another at intervals like tidal waves producing each a back¬ 
wash as it breaks, but such effects, even when anticipated in theory, 
are not always easy to unravel in practice. 

The comparison we have been discussing between changes in 
prices and marriages is suggested in Sir W. H. Beveridge’s Unemploy¬ 
ment. The whole book will repay careful study, but it contains 
one particularly illuminating chapter on ‘ Cyclical Fluctuation ’ with 
a chart labelled ‘ The Pulse of the Nation,’ because of the remark¬ 
able picture it gives of the ebb and flow of the tide of national 
prosperity. It consists of a series of curves representing respec¬ 
tively :— 

(1) bank rate of discount per cent.; 

(2) foreign trade as measured by imports and exports per head 

of the population; 

(3) percentage of trade union members not returned as unem¬ 

ployed ; 

(4) number of marriages per 1000 of the population ; 

(6) number of indoor paupers per 1000 of the population ; 

(6) gallons of beer consumed per head of the population ; 

(7) nominal capital of new companies registered in pounds per 

head of the population. 

The interesting thing about these curves is to see the way in 
which they move in waves of varying size up and down almost 
together, showing a connection between such phenomena more 

r 
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intimate than one miglit at first have suspected. A note of caution 
must be inserted here however : causal coimection must not be too 
confidently infened in discussing the correlation of characters 
changing simultaneously witli time ; because two events happen 
together, one is not necessarily caused by the other. 

An instructive article bearing on this point appeared recently in a 
periodical well known to students of social problems. It was there 
stated that high positive correlation exists between birth rate and 
infantile death rate : in general the two rise or fall together, whence 
Neo-RIalthusians argue that the way to lower a death rate is to 
lower the birlli rate. The writer then contrasts Bradford, the last 
word in the scientific care of infants, with Roscommon, where con¬ 
ditions as to wealth and child welfare are the very reverse, and 
points out that Bradford has a birth rate of 13 and an infant death 
rate of 135, while Roscommon has a birth rate of 45 and an infant 
death rate of 35. These figures, he suggests, prove instantaneously 
that the Neo-Malthusians are guUty of the commonest of all fallacies, 

they confound correlation with causation. 

As an exercise in plotting the reader may see whether he can 
discover any suggestion of correlation between crime and unem¬ 
ployment by comparing the following statistics, showing the number 
of indictable offences tried in the United Kingdom and the trade 
union unemployed percentages respectively from 1861 to 1905 


Table (21). Number of tried Indictable Offences and 
Trade Union Unemployed Percentages (1861-1905). 


Year. 

No. of Indictable 
OtTences tried 
[in tbousaude). 

1 

Trade Union 
Uneinpiojed 
percoatagee. 

Year. 

No- of Indictable 
Offences tried 
(in thousands). 

Trade Union 

Unemployed 

porceuUges. 

1861 


3*7 

1874 

63-5 

1-7 

1862 

61-3 

60 

1875 

500 

2-4 

1863 

61-4 

4-7 



3-7 

1864 

68-4 

1-9 

1876 

61-9 

1865 

69-9 

1-8 

1877 

53-8 

4-7 

y> Q 




1878 

560 

6-8 

1868 

57-6 

2-6 

1879 

550 

11-4 

1867 

69-5 

6 3 

1880 

60-7 

5-5 

1868 

62-4 

6-7 




1869 

61-3 

6-9 

1881 

60-6 

3-5 

1870 

66-1 

3-7 

1882 

63-3 

2-3 

1871 

531 

1-6 

1883 

60-8 

2-6 

1872 

61-9 

0-9 

1834 

59-6 

8-1 

1873 

53 5 

1-2 

1885 

56-4 

9*3 

1 





-■—' 
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Table (21). Number of tried Indictable Offences and Trade 
Union Unemployed Percentages (1861-1905)—Conimued. 


Year. 

No. of Indictable 
Offences tried 
(in thousands). 

Trade Union 
Unemployed 
percentages. 

1 

Year. 

No. of Indictable 
Offences tried | 
(in thousands). 

Trade Union 
Unemployed 
percentages. 

1886 

66*2 

10-2 


60-7 

33 

1887 

66-2 

7*6 


60-7 

3-3 

1888 

58-5 

4-9 

1898 

52*6 

2-8 


67*6 

2-1 

1899 

50-5 

20 

1890 1 

65*0 

21 

1900 

63-6 

2-5 

1891 

64*1 

3*6 

1901 

65-6 

3-3 

1892 

58-3 

63 

1902 1 

57*1 

40 

1893 

67*4 

7-6 

1903 

58-4 

4-7 

1894 

66*3 

6-9 

1904 

600 

60 

1895 

60-8 

5.8 

1905 

1 

61-5 

6*0 


The chief point of difficulty in plotting such graphs is the initial 
one of fixing upon the most convenient scales to use, and in this 
matter hints only can be given, facility will come by practice. An 
examination of Table (21) shows that the data cover a period of forty- 
five years which can be marked off horizontally along a base line so 
as just to fit comfortably into the available space across the graph 
paper. The unemployed percentages vary between 0*9 and 11-4, 
giving a range of 10*5. Similarly the indictable offences recorded 
(in thousands) present a range of 13*3. We might therefore very 
well choose the same vertical scale for the measurement of indict¬ 
able offences and unemployment, but, in order that the graphs 
may run more or less together (without exactly overlapping) for 
the sake of comparison, only the unemployment zero need be taken 
actually on the base line, whereas the indictable offences may have, 
say, the number 50 (thousand) at that level; also it will be con¬ 
venient to show the scale for unemployment on the right side 
and the scale for offences on the left side of the paper. 

An example dealing with matters somewhat different is provided 
by a oomparison of changes from week to week in— 

(1) the mean air temperature ; 

(2) the percentage of possible sunshine ; and 

(3) the rainfall. 

The following is a record of observations taken at Greenwich in 
1912 [data from London Statiatica, vol. xxiii.] 
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Table (22). Weekly Meteorolooical Observations 

AT Greenwich (1912). 


Week 
ended — 

M€*An Air 
Tempera- 
turD— 
Deicrocs i 
Kaliren* 
belt. 

I^er- 

ccntage uf 

pOSNililH 

•Suri^bme. 

! 

Rainfall 

in 

inches. 

j 

Week 
ended— 

j 

1 

1 

Moan Air 
Temjiera 
ture— 
DegreeR 
Fahren- 
belt. 

1 

1 Per 
,«:entageof 
' poRuibie 
, Sunshine. 

Rainfall 

in 

inches. 

Jan, 6 

45-7 

1 

7 

0-76 

■ July 6 

58-7 

16 

m 

13 

41-9 

15 

0-45 

13 

67-0 

1 46 


20 

40-2 

1 

0-93 

20 

65-8 

1 44 

0-04 

27 

38-9 

8 

0-8S 

27 i 

64-8 

31 

0-16 

Feb. 3 

300 

21 

0-02 

Aug. 3 

57-8 

33 

0-54 

10 

39-5 

15 

0-52 

10 ! 

67-6 

28 

1-26 

17 

45 5 

11 

0-44 

17 

56-2 

14 

0-23 

24 

47-4 

6 

0-65 

24 

57-2 

24 

1-27 

Mar. 2 

49-8 ; 

21 

0-52 

31 

66-9 

27 

1-33 

9 

44-6 ' 

31 

0-79 

Sept. 7 

54-8 

36 

; 0-21 

Id 

45-1 

16 

019 

14 

52-4 

14 

1 0-02 

23 

42-7 

15 

1-08 

21 

53-8 

22 

0-00 

30 

610 

46 

005 

28 

61-5 

59 

002 

Apr. 6 

48-0 

43 

0-07 

Oct. 5 

48-8 

36 

2-30 

13 

45-0 

43 

0-02 

12 

46 0 

53 

0-00 

20 

50-0 

50 

000 

19 

49-8 

38 

0-13 

27 

52-6 

76 

0-00 

20 

45-4 

23 

0-88 

May 4 

50-1 

32 

0-21 

Nov. 2 

49-1 

31 

0-55 

11 

59-7 

29 

0 06 

9 

mS^ 

6 

0*18 

18 

65-2 1 

49 

0-69 

16 


3 

017 

25 

54-1 

38 

0 19 

23 

46-2 

6 

0-31 

June 1 

67-0 

47 

0-17 

30 


13 

1-06 

8 

64-2 

35 

0-99 

Dec. 7 

42 4 

9 

0-31 

15 

68-1 

48 

0-39 


49-0 

2 

0-62 

22 

61-7 

66 

0-65 

21 

44-4 

19 

0-59 

29 

60-2 

45 

0-30 

28 

48-1 

8 

1-22 


The rainfall graph here should be drawn reversed (i.e. so that 
it goes up as the rainfall goes down in amount, and vice versa), 
because one would expect in general much rain to go with little 
sun and low temperature. 

The range of temperature during the year is 37 degrees, of sun* 
shine 75 per cent., and of rainfall 2*30 in. Hence the vertical 
scales for these three graphs might be chosen so that, roughly, 
40 units of temperature should correspond to 80 units of sunshine 
and 2 units of rainfall. Also the zeros of the three variables should 
be so placed, relative to the horizontal base line registering the 
weeks, that the three graphs may be conveniently compared without 
confusion by too closely overlapping. 









CHAPTER IX 


GRAPHS {conliuned) 

Graphical Ideas as a Basis for Interpolatioa. It frequently happens 
in statistical records that awkward gaps occur which require to be 
filled in ; this may be due to the fact that no record has been 
made, or that it has been made with insufficient detail, or that it 
has been lost or destroyed. Cases in point arise in connection with 
returns like that of the Census which can only be undertaken every 
few years, so that if figures are wanted for any intervening year, 
as they are in very many instances, an estimate has to be made 
from the known results of the years recorded. It is imperative, for 
example, for many purposes of local or national government, to 
be able to find with a fair degree of accuracy the population of 
county boroughs and urban or rural districts at any given time, 
to know the number of workers engaged in different occupations, 
the amount of land in pasture and under various crops, the con¬ 
dition of the people as to housing, of the children as to education, 
and so on indefinitely. 

Symbolically, with the same notation as we have used before, 
we conceive the statistics in tabular form, like 

*2> *8 • • • *fll 

- • - yj. 

each y denoting the frequency corresponding to the character 
measured by its companion x, e.g. the z'a may stand for successive 
dates and the y'a for the frequencies of the population of a certain 
district at tho^e dates. If it happens that one or more of the y’s, in 
between the first and the last recorded, are missing, the problem is 
to estimate the missing values by some method of interpolation, as 
it IS called. Various methods of arriving at such estimates are used, 
but we shall only refer to the more elementary here. 

A rough way of making the estimate, but one which is often as 
accurate as the data will allow, is to plot the observations, each 
f*. y) being represented by a point, and connect them up, if there 
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be enough of them, by a smooth curve drawn freehand Pj Pa Pj • • • P* 
[see fig. (11)] ; to find the y proper to any other x wo have then 
only to draw tlie ordinate through the point (x, o) and measure the 

curve. This is a not unreasonable 


y at the point where it cuts the 



Fio. (11). 


principle to follow, for in effect it 
gives due weight to each of the 
observations actually recorded, 
and it assumes an even course 
from each one to the next—a 
justifiable assumption in the 
absence of any evidence that 
some sudden discontinuity of 
value has taken place. 

If only two observations are 
given, represented by the points 
(x„ 7/,) and Pa (a:a, 2 / 2 ), the 
the y corresponding to 
shows, by 


Pi 

curve connecting them is a straight line, and 
any other x is at once given geometrically, as fig. (12) 


PjMa PiMa 




y-yi _ x—x^ 

Vt—Vi 


or y=yj-f-^?_.^»(x-r,), 

Xj—Xj 

the familiar proportional relation which is employed in this simple 

COSO. 


p. 



X, - 

Fia. (12). 


Example .—Given log 6*82673=0’7654249, 

log 6-82674=0-7654267. 

Required log 6*826736. 






GRAPHS 


87 


Here *1=6-826730 y,=0-76542-19 

*2=5-826740 yj=0-7654257 

*=5-826736. 

Therefore, by means of the above relation, 


y=0-7654249+ 


0-0000008 

0-000010 


(0-000006) 


=0-7654249+000000048 


=0-7654254. 


The logarithmic curve y=log* ia, of course, not a straight line, 
and the value obtained for y only represents a first approximation 
to the true value. 

When more than two points are given there is bound to be a 
margin of inaccuracy, more or less according to the data, intro¬ 
duced in drawing the curve. For an example of this method the 
reader may refer back to the curve on p. 67, which was used to 
determine the median and quartiles. We may, as we saw, read 
off from it the number of candidates who obtained not more than 
any stated number of marks: e.g. 300 candidates obtained not 
more than 34 marks; or we may use it the other way round and 
find the number of marks obtained by a stated number of candi¬ 
dates : e.g. 10 per cent, of the candidates got less than 17 marks. 
Such examples might be multiplied endlessly, and the method will 
be found extremely useful when a high degree of accuracy is not 
looked for. But greater confidence will bo felt perhaps in such 
results—though the foundation for it may be no more secure in many 
cases—if we can translate them from geometrical to algebraical 
form, if we can find, that is to say, some formula, like the simple 
proportional relation already introduced above, which will give 
one y when others are known. 

In order to make the argument as general as possible we shall 
speak of x and y as variables, and we shall think of the value of y 
tis depending upon that of * in such a way that when * is given, 
y is known or it can be estimated • (in the sense that when the 
year is given the population is known or can be estimated). 

Suppose 

y=sCo+CjX+C,**+ . 

[* Thii ii equivalent to assuming tbot y is some function of x, say y=/lx), 
olesrly some sucli assumption is neoessary if any estimate from the known values 
to the unknown is to be possible. Further, for simplicity we assume /(x) can 
to expanded in a Maolaurin’s converging series of ascending powers of x, w-liioh 
simply mesns that we take the reUtion between x and y to be of the form 

adopted above.] 
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where the c’e are constants to be determined, and their number 
can be made to depend upon the number of known values of y 
which are used in the estimate. 

Geometrically, the equation 

y=Co+CiX+C2a:»+ , . . 

represents a curve called a parabola of the nth order, and such 
a curve could be employed (and uniquely found—there is only one 
parabola of the kind which will go through all the points) if we 
based our estimate upon a knowledge of (n+1) y’s corresponding 
to given x'b, for we could readily make it pass through the (n+l) 
loiown points (Xj,, Vo), {Xi, y,). (xj, y^), . . . (x„, y„) by choosing 
the (n+1) c’s so as to satisfy the (w+l) simple linear relations 

yo=Co+Cia^o+C2V-r ■ ■ ■ 

yi = Co+CiXi+C2Xi*+ , . . +C„Xi" 

y2=Co+C,X2-f-C2X2*+ . . . +C„X," 


?/n—Co-\-CiX„ + C2Xn*-\- . . . +C„X„". 

When the curve is determined, in other words when the c’s are 
known, we can find any other y required by substituting the corre* 
spending x in the equation 

y=Co+c,x+C2x2+ . . . +c„x", 

i.e. by supposing this point (x, y) to lie on the same curve that goes 
through the kno\vn points. 

It is well to mention here that the parabola is by no means always 
the best curve for fitting any given statistics, and when the number 
of observations is adequate it is possible often to make a more 
satisfactory choice. Once the equation of a suitable curve has 
been determined the subsequent interpolation or calculation of y 
for any given x is not as a rule a very difficult matter. The larger 
question of curve fitting in general is reserved for a later chapter. 

Example of First Method (fitting ivith a parabolic curve). Let us 
illustrate this process of interpolation by fitting a parabolic curve 
to the following figures, extracted from Porter’s The Progress of 
the Nation, giving the annual cost of Poor Relief (excluding insane 
and casual) at five-yearly intervals, but with the amount for the 
year 1845 omitted :— 

Year . . . 1835, 1840, 1845, 1850, 1855] 

Cost in €1000 . . . 6526, 4577, ? 5395, 6890 J 
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Assuming that no extraordinary conditions prevailed in 1845 to 
cause abnormality in expenditure, let us estimate what the figure 
would be for that year judging from the given records just before 
and after. Since there are four knoum points in this case, we take 
as the curve through them a parabola of the 3rd order, namely :— 

y=Ca-\-CiX+C^X^-\-C:^ \ . . . ( 1 ) 

the four known points ^ill then just suflice to determine uniquely 
the four arbitrary constants Cq, Cp Cj, C3. Also, since the x class- 
intervals are equal, it will simplify the algebra if we measure from 
the year 1845 as origin, taking 6vo years as unit for x and £1000 
as unit for y, so that we get 

x=—2, —1, 0, +1, +2 
y=5526, 4577, yo. 5395, 5890 

where is the number to be determined. 

Since all five points are to lie on the curve with equation as ii 
(1), we have by substituting in that equation— 

6526= c^—2 ci+4c 2—8c, 

4577 = Co—Cj "h c, “ c, 

6395=Co+c,+C2-hC3 
6890=Co+2ci+4 c 2+8c,. 

Adding the first and last of these equations, 

2co+8c2 =5526+5890 .... (21 

Adding the second and last but one, 

2co+2c2=4577+5395 

or 8 co+8c2=4(4577+5395) - . . (3) 

Subtracting (2) from (3), 

6co=4(4577+6305)-(5526+6890) . . (4) 

=4(9972)-(11416) 

= 39888-11416 
=28472. 

Therefore yo=Co=£4,745,000. 

If we only wish to make use of the records for the years 1840 
and 1850, the appropriate fitting curve reduces to a straight line 


y=Co+CjX, 



on which we assume the points 

(-1,4577), (0,yo). 


to lie, so that 


4577=Co—Cl 
yo=Co 

539o Cq “h ^ • 


(+1,6395) 


Therefore, adding the first and last of these equations, 

2co=4577+5395, 

po that y9=Co=£4,986,000. 


* SecoTid Method {■using a formula connecting the ordinates). When, 
as above, the steps from each x to the next are equal, as commonly 
happens in practice, it is possible to write down a simple relation 
between the y’a, known and unkno\vn, without introducing the c’s 
at all. At bottom the method is the same as the last, inasmuch as 
the elimination of the c constants by the first method really results 
in the same formula for tlie unknown y. 

Let US represent the given statistics in this case by 

a*©. aro+A, x^-\-2h . . . Xo+nAl 

yo. vi. y% * Vn ] 

BO that, if the fitting curve be 

y=Co+CiX-^CjX^-)- , . . + 0 ,*", 


we have, by substituting the co-ordinates of the first two points 
in this equation, 

yi=Co+Cj(aro+A)+Ci(.ro+ft)a+ . . . +c„(a;o+A)" 

and yo=Co+Ci x^ +Ca arg* + , , , +c„a;o*‘. 

Hence 

!/i—yo=CiA+«2(2aroA+A«)+ . . . +c„(7iXo''-*A+ . . .). 


Now this result, which we call the l«i difference between the y’s, 
is of {n —l)th degree in x^, so that by subtracting two of the y’s 
we have reduced the degree in 2:0 by 1. Similarly, 

yi—yi=Cih-\‘Ci{2xffi+^h^)-\- .... +c„(rw:t,"-*A+ . , 

Thus we get a series of differences, each with the highest 
term of the (n — 1 )th degree in Xq. Treating them as a series of new 

[* Tho non-mathematical reader will do well to omit the rest of this eectioo od 
interpoIfttioD. ] 
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ordinates and forming their differences in the same way, we get 
what may be called the 2nd differences between the y’a, a series 
of ordinates each with the highest term of degree (n—2) in a-Q. 
Proceeding in this way the Zrd differences between the y’s are a 
series of ordinates of degree (n—3) in Xf,, the ith differences are of 
degree (n—4), and so on, until ultimately we reach the nth differences, 
which are of zero degree in Xg, and consequently involve only h. 
It follows that the nth differences must all be equal in value and 
therefore, if we go one step further and write down the {n-\-\)ih 
differences, these must vanish aliogether. 

If the reader finds any difficulty in following the argument 
he should test it step by step for himself in the simple case of a 
parabola of the third order when it should be perfectly clear. 

The formation of the successive differences is conveniently shown 
in Table (23). 


Table (23). Successive Differences or Ordinates. 



yirtt 

Second 

Third 

F«>urUi 

Firth 


dlffcrcoec 

difference 

difTerent# 

diffpr^oce 

A 



A*. 



vi-avi+vo'l 




Vl-VlJ 

Vi-^Vi+Vt) 

y»->w+8n-vo'I 

V4-*y»+*vt-*yi+yo) 



y4-*vj+*y2-yii 

[ 

V» -8v4+10vi - 10ya+8yi - Vo 

1 

»«-2»+VS 


ys-8y4+8vi-«va+yi/ 



V8-2v4+yi 

V4-8v«+3vfj-kj 1 









The law of formation should be apparent from this table, for it 
is precisely that which we meet in the binomial expansion, e.g. the 
nth difference is of t 3 rpe 

__ n(n-l)(n-2),. , ,, 

Vn ^ 2 -1— 2~S —* • • +1 Vo* 

and by equating to zero the (n4-l)th difference we have the relation 
required between the y’s. 

Example .—Let us apply this method to the ‘ Poor Relief ’ example 
already considered. Since there are four known points the relation 
between x and y must be of the form 

y=Co+c,a:+c,a:*+c^ 

ae before. Hence the 4th differences must vanish, and taking the 






ox AJ.XOAX^O 


points in order from years 1835 to 1855 as (x^,, y^), (r„ yj, (x^, yj, 

(*3. 2/s). (^4- V*)> we get 

1/4 ~ 4y3+6y2—4y, 4-yo= 0 

as the formula connecting five y’s, four known and one (y^) unknown. 
Therefore 6y2=4(y, +y 3 )- (7/^+y,) 

=4(4577+5395)-(5526+5890). 

which is equivalent to equation (4) on p. 89. 

Thus y, = £4,745,000. 

Third Method (6y means of advancing differences). In the last 
method we employed a relation connecting y„ with all the preceding 
y’s, but it is possible also to express y„ in terms of y^ and the sue* 
cessive differences, which may be written A. A^ A*. • • • A"J 
we have, in faot, with the notation of Table (23);— 

Ao=yi-5/o. Ao“=y2-2yi+yo. Ao“=2/3-3ya+3yi-yo. . . . 
Thus 

!/i=2/o+Ao- 

y2=2yi--yo+Ao*=S/o+2Ao+Aft*. 

S/s=3y 2—3y 1+2/o + A 0* 
=3(yo+2AoH*Ao*)-3(yo+Ao)+yo+Ao' 

=2/0+3 Ao+3Ao*+A o"- 

y*= 4j/ 3 - 6y2+4y, - y^H- A 

=4(yo+3Ao+3Ao®+Ao’')-6(yo+2Ao+Ao*)+4(yo+Ao)- 
2/0+A 0^ 

=yo+4Ao+6Ao*+4Ao"+Ao*- 


Here again the law of formation is clear, and it is readily estab' 
lished by induction that, for all positive integral values of n, 


2/ft=2/o+«Ao+ 


s , n(n—1)(7 i-2) 


1.2 


Ao* + 


1.2.3 


Ao*+ 




(6) 


a series which automatically comes to an end at the term Ao"* 

An extension of this formula is obtained by writing d in place 
of n, where O<0<1. We then get 


S'# = Vo+^ A 0 - A 0 “ 


+ 


0(i-ff)(2-e) 

1.2.3 


Ao*- 



which enables us to interpolate for a y in between any two of a series 
of y’s corresponding to x’a advancing by equal steps. This relation 
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ya 

is no longer identically true as was (5), for the series on the right 
in (6) is unending, but its application in practice is justitie<l when, 
as the differences advance, the numbers obtained tend to grow 
smaller and smaller, so that the remainder after a certain number 
of terms can be treated as negligible. Unless this tendency is 
realized without carrying the differences far the formula is not 
very satisfactory. 

To illustrate the method of procedure the following figures may 
be used from Table (7), p. 25:— 


Table (24). Marks obtained by certain Candidates 

EN AN Examination 


4 

Ko. of 

First 

1 Second 

1 Third 

No. cf Mftrka. 

CaodiJ&tea. 

dilforence 

difference 

difference 


y 

L 

A* 

C.* 

t 

Not more than 45 

447 

37 



9f M *» 60 

484 

21 

-16 

1 

H »» 19 65 

605 

6 

-15 

12 

M M $t 00 

611 

3 

- 3 


1 

*• St 99 66 

1 

614 





Suppose now wo wish to know the number of candidates who 
obtained a number of marks not more than 48. In that case, in 
applying formula (0), we have 


yo=447, 0=(48-45)/(5O-45)=3/6, 

Ao= 37, Ao*=“16. Ao*=1. 


and hence, up to this order of differences, the required number 
candidates is given by 


447+§ . 37-M>(_16)+|1^(1) 

1*2 1 . 2.0 



=447+22-2-h 1 •92+0-06 
=471. approximately. 

Also, number of candidates obtaining more than 48 marks, but not 
more than 60 


=484 -471 
= 13. approximately. 
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Fourth Method {by means of Lagrange's Formula). We shall 
consider one more formula, duo to the famous French mathematician 
Lagrange (1736-1813), which is useful when the recorded y'% corre¬ 
spond to x'% wiiich advance by unequal stages. 

Let the given statistics be represented as before by 

5^0)* (^1' 2/1)1 (•*'21 3/2)1 ■ • • (S^ni 3 /n)i 

and consider the equation 


2/=3/o 


ix—Zi){x—x^) 


+2/ft 


ix—x„) 


(X© 2r2)(XQ Xj) . . . (X(j— Z„) 


• • • (*—a^n) , 
“ryi, ---:-^--i 


(Xj ^o)(a'] ajj) . . . (Xj—x„) 

(X—Xo )( X—Xi) . ■ ■ (X—X^- l) 

(a;n-a:o)(a:„-x,) . . . (x„—x„_i) 


• (7) 


It is of the nth degree in x, and it is identically satisfied by the 
(n+1) pairs of values 

{x=Xf„ y=yo), (xs=xi, y=yi), . . . (x=x„, y=3/„). 

It wiU therefore clearly serve as the fitting curve 

y=Co+Cia:+C 2 x 2 + . . . +c„x'‘, 

being exactly of this type, and in order to get the y corresponding 
to any other x we have only to substitute that value of x in (7). 


Example. —The following figures, based upon data from Porter’s 
The Progress of the Nation, show the age distribution of criminals 
in the year 1842. 

Percentage of criminals up to age 25=52-0 {y^). 
i> >■ ,( 30=67-3 (y^)' 

.. .. „ 40=84-1 (yj). 

» 1. « 60=92-4 (ya). 

Let us employ Lagrange’s formula to find the approximate 
percentage of criminals up to 35 years of age, making use of the 
four ordinates given, and taking x=35. We have 

y_f;^<35-30)( 35-40)(35-50) (35-25)(35-40)(35-60) 

(25-30)(25-40){25-60)'^ {30-25){30-40)(30-60) 

g^.j (35-25)(35-30)(35-5Q) ^ ^( 35-25){35-30)(35-40) 
(40-25)(40-30)(40-60)'^ (60-25)(60-30){60-40) 

=— 10 - 4 + 50 - 476 + 42 - 06 - 4-62 
—77-6. 
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Reasoning made Clear with the Help of Graphs or Curves. The 
graphical method not only produces an instructive picture of a 
scheme of observations, but it may also be used effectively on 
occasion to pilot one through the intricacies of economic or similar 
argument. The eye is a very ready pupil and is quick to pass on 
what it sees to the mind ; it acts, that is to say, as an ally to the 
’inderstanding, which might get on v\'ithout it, but which certainly 
gets on faster with it. 

To illustrate this we shall consider the first principles of an 
interesting class of curves relating 
to supply and demand.* 

Curve of Demand. Conceive a 
smoker who buys cigarettes at 
the rate of x per day, and pays for 
them at the rate of y pence each. 

Altogether they cost him there¬ 
fore a sum of xy pence per day, 
which is conveniently measured 
by the rectangle OABC in fig. (13). 

Notice that the cost price of each single cigarette is here represented 
by the area (y x 1), while the total expenditure is represented by the 
area (yx«). 

Now let us suppose his country is at war and that the smoker, 
to put himself in a position to discourage luxuries, decides to give 

up smoking. Let us try to 
measure in terms of pence the 
cost of this great sacrifice to 
him on the first day. 

The first cigarette is probably 
the hardest to do without, and 
the desire for it is so strong 
that, if it were a mere matter 
of money and not of patriotism, 
he would be willing to give as 
many pence as are represented, 
say, by the rectangle 1-1 in 
fig. (14) in order to have it to smoke. If he went on to bargain 

[* A fuller aooonnt of these ouiree wiU be found in Cunynghame’s Qtamtiricai 
P<A>t\cal Eemomy, where a rather more aoourate interpretation of “surplus 
Value ” U giren, involving the introduoUon of subordinate curves. The 
Amplified statement here adopted seemed soffioient in an iotrodnotory course. 
MartbaU'e PrmeipUa Bconomiti alM oontaini many faeoioating Ulostrations 
of the me of such curves, mainly in footnotes.] 


h 

S 

}= 




E 

B 


12 3 4 

Numbtr of elgartUti bought 
Fia. (14). 


A X 


A a 



Number of cifjorettes bought 
Pio. (13). 
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with himself in imagination, he would not be ready to offer quite 
so much for the satisfaction of a second smoke soon after the 
first: he %vould pcrlnifjs only give a number of pence represented 
by the rectangle 2-2 in the figure for this second cigarette. And 
if it came to a third he would oficr less still, only ‘ 3-3 ’ pence 
perhaps, for the fourth ‘ 4—4 ’ pence, and so on. The rectangles 
here are of varying height, but each stands on a base of unit length. 

Thus we find that the total sura he would he prepared to offer, 
bargaining for cigarette after cigarette in this way, would be repre¬ 
sented by the sura of the rectangles 1-1, 2-2, 3-3 ... in fig. (14), 
where the addition of each unit length along OX means one more 
cigarette in imagination smoked, and a diminution of unit length 
in an ordinate parallel to OY means a reduction of Id. per cigarette 
in the price the smoker would he prepared to pay. 

But if he fell a prey to his persistent craving and actually bought 
a number of cigarettes represented by OA in the figure, each would 
cost him in the ordinary way only a number of pence represented 
by AB, say, i.e. area (ABx 1), and his total expenditure would thus 
be measured by the area of the rectangle OABC. He would get 
them, that is to say, for less than he would be prepared to give 
rather than go without them. The difference, the area of the 
rectangles maldng up the portion BCDE of fig. (14), represents the 
measure in pence of surplus enjoyment which he would obtain free 

of charge, or it represents the 
measure of free sacrifice he 
makes if he is true to his 
patriotic principles. 

Let us now take an example 
on a larger scale. Imagine a 
small community of people, 
producers and consumers, buy¬ 
ing and selling among them¬ 
selves. Some of them are 
Fio. (15). coalowners and sell coal to 

the others in the open market, 
where competition is supposed free and unrestricted in any way. This 
last condition is emphasized, because it is seldom perfectly satisfied 
in the real world of commerce. 

Just as in the previous case we may represent the number of 
cwts. of coal bought by a length OA measured along OX in fig. (15), 
and the price actually paid in shillings per cwt. by the area of a 
rectangle on unit base and of height OC along OY. Thus the 




total cost to the consumers in shillings is'measi^^ by the area of 
the rectangle OABC. 

But here again we may picture the consumers during a coal 
shortage, when, rather than go without the first cwt. of coal, some 
one among them would be ready to offer for it as many shillings as 
are represented by the rectangle 1-1 in fig. (15), and for the second 
owt. some one would be ready to offer ‘ 2-2 ’ shillings, for the third 
‘ 3-3 ’ shillings, and so on. The demand for coal could thus be 
measured in shillings by the sum of the rectangles 1-1, 2-2, 3-3 
. . . and, if OA runs into thousands of units of coal, the lengths 
0-1, 1-2, 2-3 . . . along OX, corresponding to additions of 1 cwt. 
in the quantity bought, would in the limit be so small that the 
sum of the rectangles would become practically equivalent to the 
curvilinear area OAED in the figure, where DE is a curve drawn 
through the summits of the rectangles, namely the curve, of demand. 

The can^mers' surplus in this case would be measured in shillings 
by the area BCDE, this being the difference between the measures 


of the sum actually paid for the coal bought and the sum consumers 
would have been willing to pay rather than go without it. 

Curve of Supply. Now let us consider the question from the 
point of view of the coalowners. We shall assume that the average 
cost of production per cwt. of y 
coal increases steadily as the 

number of cwts. produced in- ^ ' I® 

creases; this would not be an | o X" 

unreasonable assumption in most 11 ^ 

cases after passing a certain point, I, ^ , 2^2^--^TT^ 

since the richer coal measures ” 
known are likely to be mined 5^ 
before the poorer ones, and the 

cost of mining near the surface — . 1 I ' I U -1-— 

16 bound to bo 1 o 8B thetn whOD Humbtr coo.\ 

deep shafts have to be bored. 0®)* 


•* i 


OP 2 3 


Humbtr of cwU. ofc^al sofd 
Fto. (1«). 


If, then, OA, fig. (16), represents the number of cwts. of coal 
sold, and if the price in shillings per cwt. at which it is sold is de¬ 
noted by the area of a rectangle on xinit base and of height OC 
along OY, the total payment received by the coalowners will be 
measured in shillings by the area of the rectangle OABC. 

But the cost of producing the first cw^. is perhaps measured 
by the rectangle 1-1, that of producing the second cwt. by the 
rectangle 2-2, the third by the rectangle 3-3, and so on, each rectangle 
being drawn on onit base representing an advance of 1 owt. (The 
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advance in the cost of production would not in reality be measured 
by BO much the cwt. of course, but the assumption is inaccurate 
in degree only, not in principle, and, by making it, the argument 
is rendered clearer.) Thus the actual cost of production is, in the 
limit when OA is very large and divided up into relatively verj 
email parts, measured in shillings by the curvilinear area OAED, 
where DE is a curve drawn through the summits of the rectangles 
namely, the curve, of supply. 

The difference, BCDE, between the areas OABC and OAED 
represents what is known as producers' surplus, for it measures the 
profit made by the o\vner9 in selling the coal at a higher price than 
the cost price of production. 

Now let us combine the curve of supply (S.C.) and the curve of 
demand (D.C.) in the same figure, fig. (17). Their meeting point 

P determines the number of cwts. 
of coal bought {x), and the selling 
price in shillings per cwt. {y). 
For it is clear that under normal 
conditions it would not be profit¬ 
able to coal producers to pass this 
point, because beyond it the de¬ 
mand on the part of coal consumers 
measured in money is less than 
the cost of production ; they are 
not w illin g on the average to pay 
80 much as ys. per cwt. for it, 
and it costs more than ya. per cwt. 
on the average to produce. If, 
on the other hand, the amount of coal produced decreases below 
X cwts., the greater this decrease the higher does the profit become 
on the sale of it, because the greater is the difference between the 
cost price and the selling price ; hence, as profits become more 
pronounced, recruits will be attracted into the coal-producing 
business, and, if this goes on, deeper shafts will have to be bored 
and poorer fields worked until profits begin to decrease again and 
the supply once more approaches x cwts. Thus sooner or later 
the production of coal and its market price will tend to the level 
determined by the equilibrium point P where the supply and 
demand curves meet. 

Endless varieties of problems may be discussed by altering the 
conditions and observing the effect produced in the standard 
diagram. Three examples will suffice to illustrate the method. 



Number of cwts, of cooi bought or sohf 


S. C.» Supply curoe 
D»G»^Oeman<l curoo 


Flo. U7)« 
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1. Effect of a Change in Normal Demand. Here we suppose the 
normal conditions of supply are unaltered—it costs just as much 
as before to produce the same amount of the commodity in question ; 
but a more eager demand on the part of consumers shows itself in a 
readiness to purchase more at any given price than would have 
been purchased under the old conditions: this may conceivably 
be due to a general increase in the purchasing power of these con¬ 
sumers, or it may be the result of a shortage of some other com¬ 
modity which causes this one to be more widely used, just as 
margarine, for instance, has been known to take the place of butter ; 
whatever the reason may be, the effect is that the demand curve 
now occupies a higher level throughout its length, D'C' in place of 
D.C. in the ffgures. 

When we turn to the supply side of the question, there are three 



stages which, although they shade into one another in practice, it 
is well to separate clearly in theory : (1) the only supplies immedi¬ 
ately available are those actually in the hands of dealers; (2) to 
meet the increased demand, and so earn for themselves increased 
profits, manufacturers will speed up production, by working over¬ 
time, etc., with the help possibly of any disengaged labour or 
capital they may be able to secure, and the resulting extra supplies 
will be available after a short time ; (3) if the demand continues 
unabated, manufacturers, by offering higher wages and interest, 
will seek to attract fresh labour and capital from other engagements 
into their business, and, by renewing their machinery and generally 
improving their organization, they will produce on a larger and 
relatively more economical scale. Moreover, other manufactimers, 
seeing the profits to be earned, will be attracted into the same line 
of business also, so that by this time the current available supplies 
of the oommodity may exceed very appreciably their old figure 
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But all this happens only in the long run, and the economist has 
always to bear this extremely important element of time carefully 
in mind when he seeks to estimate the effects of any proposed 
action. 

We assume then that the new demand remains long enough at 
its higher level to allow for the gradual adjustment in this way of 

supply to the changed 
conditions, and for the 
economic forces called into 
play once again to arrive 
at a balance between 
them, most likely at a 
new equilibrium point. 
The two figures illustrate 
the difference in effect 
according as the produc¬ 
tion of the commodity is 
subject to a decreasing or 
an increasing return, i.e. according as the cost of production rises 
or falls when the amount produced is increased. In both cases it 
will be noted that more of the commodity is produced (ON' in place 
of ON) in answer to the keener demand, but the difference is much 
greater in the second case than in the first. Also the price has 
gone up in the first case, while in the second it has gone do^vn, 
the difference being measured by the change in PN. 

2. Effect of a Tax. If the 
tax is at the rate of so much 
per unit (say Is. per unit, if 
the price is measured in shil¬ 
lings) of the commodity pro¬ 
duced, this will raise the 
supply curve, S.C., bodily up 
a distance of I unit into the 
position S'.C'., fig. (18), be¬ 
cause the effect is the same 
as if la. were added to the 
cost of each unit in produc¬ 
tion. The production ■will 

thus be di m i ni shed by N'N units, for P' is the new equilibrium 
point; the selling price ■will be increased by P'lils per uni t—by 
less, it should be noted, than P'Q or K'K, the amount of the tax ; 
producers’ surplus, which is analogous to what economists term 
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•en(, iB diminished by (area KPL—area K'P'L')8; oonsnmers’ 
surplus is diminished by (area PLL'P')s; linally, the tax produces 
for the Treasury a number of shillings represented by a rectangle 
with sides of length ON' and KK'. 

3. Effect of a Monopoly. A monopolist has the power to stop 
production short of the true equilibrium point, so that ON' cwts.. 
fig. (19), are produced in place of the ON cwts. which free competi¬ 
tion would demand. The selling price is thus rai.sed by Q'Ss. per 
cwt.; producers’ surplus is increased by (area KP'Q'M'—area 
KPL)s; while consumers’ surplus 
is diminished by (area PLD—area 
DM'Q')3. 

A word of explanation is neces¬ 
sary before leaving the subject of 
these supply and demand curves. 

It is probable that the reader will 
have questioned the possibility of 
drawing such curves for any com¬ 
modity with sufficient accuracy to 
be of any value, but it would be 
enough as a rule to be able to estimate what would happen 
if a slight variation occurred in price or in production, and such 
an estimate may sometimes be made by actual trial : e.g. a good 
practical farmer most likely knows nothing about supply and 
demand curves as such, yet from past experience he has a pretty 
shrewd notion as to how far it may be profitable to spend an extra 
pound here in rearing calves and a pound less there in cultivating 
crops, bearing in mind the prices which cattle and com might be 
expected to fetch. From his point of view the interest of the 
curves, if he know anything of them, would be centred in those 
portions which correspond to normal conditions, t.e. somewhere in 
the neighbourhood of the equilibrium point under the free play of 
ordinary competition. 

Their real value, however, as suggested at the beginning, does 
not consist in the practical assistance which they afford to the pro¬ 
ducer or consumer, by way of foretelling the actual measure of 
consumption or production, so much as in the light they throw 
upon general tendencies which are rather apt to be obscured if they 
are ponderously presented with elaborate economic argument. 
They make plain in a moment to the eye what can only be stated 
in two or three pages of writing. 
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CORRELATION 

One of the most important questions which can be discussed bj 
statistical methods is that of possible connection, or correlation, as 
it is called, between two sets of phenomena. If some factor in 
each can be isolated and measured numerically, our object is to 
discover if the size of either is sympathetically affected when a 
change occurs in the size of the other ; or, to put the matter in 
another way, do large values of the one factor go with large values 
of the other factor and small with small, or vice versa ? And, if 
some mutual dependence of this kind exists, can an estimate of 
its extent be made 1 

Consider, for example, the factor or character of height in husband 
and wife. Is there any connection between stature of husband {x) 
and stature of wife (^) ? Do tall men tend on the average to wed 
tall women, or do we find tall men choosing short women for wives 
just about as often as they choose tall women ? When correla¬ 
tion exists we shall want some measure for it which will tell ua 

the amount of change or devia¬ 
tion from the average in either 
character associated with a given 
change or deviation from the 
average in the other. 

In stud)dng graphs we saw how 
some hint of the existence of 
correlation might be discovered, 
but we wish now to go a little 
more deeply into the subject. 
The first step is to measure an 
adequate number of pairs of values, x and y, of the characters 
concerned in order to find what values are associated together, 
and how frequently the same values are repeated. When this is 
done we can draw up a table of double entry, see fig. (20), setting 
out in rows and columns the frequencies observed. An examina¬ 
tion of Table (26), showing the variation of brain weight with age 

lOS 
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in the case of 197 Bohemian women, will make clear what is meant. 
The x’s from to and the y's from to y^ are supposed to 
ascend in magnitude, and when, for example, the pair of values 
^* 2 . Vi) is observed to be repeated nine times, the number 9 is placed 
in the second column and third row of the table, so that the frequency 
of each class is found recorded in the square proper to it: thus, 
out of the sample in Table (25), there are 10 women between the 
ages of 40 and 60 with brains weighing between 1300 and 1400 
grams. 


Table (25). Vabiation of Beain Weight with Age in the 
Case of certain Bohemian Women. 


[Data from Biomelrika, vol. iv. pp. 13 et eeq.. Variation and Correlation 

m Brain Weight, by Raymond Pearl.] 




Age in years 




20-30 

^2 

30-40 

^^3 

40-S0 


60-70 

70-80 

Totals 

Brain-weight in grams 


1 

1 

1 



- 

3 

1100-1200 

2 

2 

1 

4 

2 

5 

4 

19 

1200-1300 

28 

9 

8 

1 

14 

10 

4 

73 

1300-1400 

26 

14 

10 

6 

5 

! 

4 

65 


13 

7 

7 

2 

- 

2 

31 

1600-1600 

2 

3 

- 

1 

1 

- 

6 


roia/3 

72 

35 

30 

26 

20 

14 

197 

k_ 

Mean y \ I325 

1350 

1310 

1285 

1250 

1279 



When each class interval, as in this table, includes a small range 
of values, the * and y may, as an approximation, be taken as the 
mid values of their class intervals ; y^ would be taken, for instance, 
as 1260, though it really includes all values between 1200 and 
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1300 grams. Strictly in such cases each single observation is not 
geometrically speaking, located at a definite point, but lies some¬ 
where within a small area, though it is treated as if it had the values 
X and y which apply to the centre point of the area. It is some¬ 
times possible to correct for this assumption by what is known as 
Sheppard s adjustment, but W’e shall not concern ourselves with 
the correction in the present discussion, so as to avoid complications, 
because the difference made is not generally large. 

The table, when drawn up, may immediately suggest some 
intimate connection between x and y. It may indicate that as 
X increases y also in general increases, or that y tends to fall in 
value as x grows bigger. But a more refined analysis is necessary. 
It would be instructive perhaps to travel along the row of x’s, find¬ 
ing what mean value of y is associated with z,, what mean value 
of y is associated with x^, and so on. This would give a sounder 
basis for judging whether, as x increased, y in general increased or 
decreased as the case might be : for example, in Table (25) the 
mean values of y associated with the several types of x are shown 
in their proper columns at the foot of the table and clearly, as 
X increases, y tends to decrease, apart from conflicting readings at 
the beginning and end of the table, and the latter of these may not 
be sigmficant of any real difference in brain weight at the end of 
life, for it is only based on fourteen observations ; generally, the 
inference from this table would be that the weight of the brain 
decreases as the age increases after maturity is once reached, 
although, of course, it would be rash to make more than a tentative 
statement with so small a sample at our disposal. 

Let us suppose y^ to be the mean value of y associated with x^, 
Pa the mean value of y associated with arj, y^ with 0 : 3 , and so on. 
If these values (i„ y,), Pj), (Xg, y^), etc., are plotted, it is very 
often found that they cluster more or less closely about a straight 
line, see fig. ( 21 ), so that we are led to ask w'hether there is not 
some lino w'hich will very fairly describe the run of the points; 
the equation of such a line would be 

p=TOX-l-C, 

and if m and c were known we could find from this equation the best 
average value of y corresponding to any given x. 

But, on reflection, pj, pg, ps . . . are themselves only the best 
y'a corresponding to the particular values Xj, Xg, Xg . . . of x, so 
that the problem is really the same as that of finding the relation 

y=mx-j-c. 
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boied on all the observations, which will enable us to estimate the 
best y corresponding to any given x. 

Now for any value Zj of x the value of y given by this relation 
is (mZi+c), while by observation we may find more than one value 
of y corresponding to the value Zj of z. If be one such value 
the difference between it and the value given by the above rela¬ 
tion is 

This difference we may regard as the error made in estimating y 
from the relation instead of taking the value given by observation 



which for the moment we think of as the true value. The best 
relation will then clearly be the one which makes all such errors of 
estimate as small as possible. But, algebraically, some of these 
errors are positive, i.e. the value of y given by the relation is greater 
than that given by observation, and some are negative, and it is 
only their magnitudes that we wish to take into account. Accord¬ 
ingly we follow the method used in finding the standard deviation 
in order to get rid of the ambiguities of sign : we form, that is to 
say, the sum of the squares of the errors, because the expression so 
formed will clearly be least when each separate error is as small as 
possible in absolute magnitude. 
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To find, then, the values of m and c which will make 

+ . . . + c—y„)* 

a minimum {see Part II, p. 271, Note 7), where n is the total 
number of pairs of observations. 

The required values are given by differentiating, first with regard 
to c treating m as constant, and then wth regard to m treating c 
as constant, putting each result equal to zero. Thus 

(m^i+c—. . . q-(77ix„+c—1/„)=0 
(mxi+c-y,)a:j-|- . . . -^(mx^^c~yjx„=0 
Therefore . . . +x„) + nc-(y,+ . . . +yj=0 

Jn(3r,2+ . . . +a:„2)-[_c(xi+ . . . +x„)—. . . x,^„)=0. 

The first of these equations gives 

m(nx)+TIC — (ny)=0, 

Tn^+c—p=0, 

where i is the mean of all the x’s and y is the mean of all the y’s, 
and it expresses the fact that the line t/=Tnx*f-c passes through 
the point (f, p). 

This might have been expected, for, graphically, each pair of 
observations (Xj, yj), (zj, j/g), (Xg, yg) . . . corresponds to some point, 
and if we look for the line y=mx-|-c passing through the region 
where they cluster most thickly together we should certainly expect 
it to pass through their mean or centre of gravity (5, J). This 
suggests how the values of tti and c may be considerably simplified. 
If we measure all the x*a from x, their mean, and all the y’s from 
their mean, which is equivalent to taking the point (£, y) as origin 
and replacing every x by its deviation ^ from x and every y by 
its deviation y from y, the first of the above relations is reduced 
to c=0. and therefore the second becomes 

. . . +f„’)J = 0. 

Hence «i=(f,'/,-h . . . . . . +f„“) 

=Tip/T«7,* 

where p is the mean of all the product pairs and a* is the standard 
deviation of all the x’s. 
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Thus the required equation for estimating the best v correspond¬ 
ing to any particular ^ is 

V=plaJ^. 


whence 


y~y=l-{x-i) . . , ( 1 ) 


The coefficient pja^ in this equation evidently gives the deviation 
in y from the mean y conesponding to unit denation in x from 
the mean z, for when (z—z)=l, {y—y)=v'i^x^- Hence the greater 
this coefficient is, the greater will be the change in y resulting from, 
or at all events coexistent with, unit change in z. 

Thus p/cr** would seem to supply a not unreasonable measure of 
the correlation between z and y. But there is something very 
unsymmetrical about this result. Why sliould the correlation be 
measured by p/o** any more than by p/< 7 „* ? In fact, %ve might 
repeat the whole of the previous argument, interchanging z and y 
throughout wherever they appear. In that case we should flrst 
travel down the column of y's and calculate the mean values of z 
associated with y^, y^, t/, . . . respectively. This would give a set 
of points (z„ 1 /,). (zj, 1 / 3 ). (£ 3 . t/a). .... which, when plotted, would 
perhaps lie approximately in a straight line. W’e should thus be 
led to look for some relation 


x=m'y-\-c' 

which would enable us to estimate the best average z corresponding 
to a y of given type, and, proceeding just as before, we should 
ultimately obtain the equation 

^=p/<7v*. V, 


or {x-£)=^(y-g). . . • (2) 

in which the coefficient p/a,* gives now the deviation in z from the 
mean z corresponding to unit deviation in y from the mean y. 

Hence p/oy* has, seemingly, just as much claim as pjo^ to measure 
the correlation between z and y. The one gives the change in z 
corresponding to unit change in y : the other gives the change in y 
corresponding to unit change in z; and the only reason why they 
differ is because unit change in z does not mean the same thing as 
unit change in y : their standards of changeableness or variability 
are not equal. If then we could alter the scales of measurement 
so that unit change in each were of the same magnitude, the two 
coefficients obtained ought to become identical, and we should then 
have a really satisfactory measure for the correlation required. 
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With this object let us examine the variability of the x'b and 
jompare it with the variability of the y'a. Now the total dispersion 
of the different x’b on cither side of x, the mean x, is conveniently 
measured by a*, their standard deviation. And similarly the 
dispersion of the y s on either side of y, the mean y, is measured 
by (7j,. The bigger a* is, the greater is the variability of the x’a, 
and the bigger is. the greater is the variability of the i/b. Hence, 
in equations (1) and (2), (x— x) should be divided by cr* and (y—y) 
by Oy if we want to work with the same unit of change or variability 
in each case. The equations then become 

/v-y\ V /x—XV 

\ Oy / / 

and _ 

\ <7, / <T^y\ <7y / 

Write r=p!a^(jj ; then r is taken to bo the coefficient of correla- 
tion, for it measures the change in either character corresponding to 
unit change in the other when the units are made comparable. 

The lines giving the best y for a given x and the best x for a 
given y may now be written 

and x— 

and they are called lines of regression. The term regression was 
first used by Sir Francis Gallon in a paper entitled Regression 
towards Mediocrity in Hereditary Stature, though the root idea 
is not by any means confined to characters affected by heredity: 
it holds for any pair of correlated variables. Gallon found that 
if a number of tall fathers are selected and their heights measured, 
the mean height being calculated, and if. further, the heights of the 
sons of these fathers are measured, their mean height being like¬ 
wise calculated, the latter is not equal to the mean height of the 
selected fathers, but is rather nearer the mean height of the popula¬ 
tion as a whole. There is, that is to say, a regression or stepping 
back of the variable towards the general average. Professor Karl 
Pearson has remarked that ‘ in the existing state of our knowledge 
the recogmtion that the true method of approaching the problem 
of heredity is from the statistical side, and that the most we can 
hope at present to do is to give the probable character of the offspring 
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of a given ancestry, is one of the great services of Francis Gallon to 
Biometry.’ 

The expressions r— and r— are called coefficients of regression, 

Ox <Jv 

and they register in the above particular case the amount of abnor¬ 
mality to be expected in the height of the sons when the amount of 
abnormality in the height of the fathers is known, and vice versa. 
The regression of the sons’ height, y, on the fathers’ height, x, is, 
in fact, defined as the ratio of the average deviation of the heights 
of the sons from the mean height of all sons to the deviation of the 
heights of the fathers from the mean height of all fathers, and hence 
it may be written 

My-y)i{x-x)=ro^ja^. 

To make the definition more general, instead of spealting merely in 
terms of height, we refer to any row or column—for there is no 
mtrinsic difference between row and column—in a table like 
Table (25) as an array of y’s or of x’s, and selecting a particular 
^ype, say a particular value of x (like fathers of height x), we define 
the regression of the corresponding array of y’s (like heights of sons 
of these fathers) on the type x to be the ratio of the average devia¬ 
tion of the array of y’s from the mean y to the deviation of the 
selected type x from the mean x. 

Example. To illustrate, let us take some figures due to Professor 
Pearson and Dr. Alice Lee \_Biometrika, vol. ii. pp. 357 et seq., On 
the Laws of Inheritance in Man\. Suppose the mean stature of all 
observed fathers, based on a sample of over 1000 observations 
=67-68 in., with S.D.=2-70 in. 

Also suppose the mean stature of all 8ons=68-65 in., with S.D. 
=2-71 in., and that the correlation r between stature of father 
and stature of 8on=0-514. 

The regression of son on father ae regards stature is then given bj 

(y-68-66)=(0-614)^(x-67'68) 

where X is the height of selected fathers and y the mean height of 
their eons. 

y=0-616x-f33-73, 

*0 that if we selected fathers of height 70 in., for example, the 
mean height of their sons would not be 70 in., but 

(0-516)(70)-l-33-73=69-h5 in. 



110 


STATISTICS 


i.e. there is a regression towards the general mean, 68 C5 in. of 
all sons. 

Also the coefficient of regression 

= (0-5l4){2-7I)/(2-70) 

=0-516. 


It is not difficult to show that the greatest numerical value f 
can in general take is unity, for consider the expression for the 
sum of the squares of the differences between the observed devia¬ 
tions of the y characters from their mean and the corresponding 
deviations as deduced from the best fitting regression line, 


y-9=r^{x—x). 

Ox 

If, with our previous notation, ^ denote the observed deviation of 
the one character y, associated with a particular deviation, of 
the other character, x, then, since denotes the best value 

given 6y the line^ the sum of the squares of the differences between 
these values 



=(V+ - • • 

= 2r^(nrCT*CT^)+r2^(n<7**) 


Since the sum of a number of squared quantities must be positive, 
it follows that must be less than 1 and hence r lies beticeen — 1 
and +1. 

Further, 7Ujy\\ —r^) can only vanish if every one of the squared 
quantities on the other side vanishes independently of the rest, 
BO that we only get r=±l, when 

^l/^i = V^2= • • • 

In this case the deviation of the one character from its mean is 
always exactly proportional to the deviation of the other character 
from its mean, and the correlation is then said to be perfect^ for 
it is equivalent to caxtsation. In perfect correlation a one-to-one 
correspondence thus exists between the values of the two char¬ 
acters, for to one value of either there corresponds one and only 
one value of the other and the standard deviation of the array 
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(meaeuring iu variability) corresponding to any selected type 
vanishes. 

Zero correlation is at the opposite extreme where, no matter 
what the type selected in the one character may be, the mean 
value of the array in the second character is unaffected, because 
the two characters are quite independent or uricorrclated ; the 
deviation of y from its mean bears no relation at all to the deviation 
of X from its mean, and unit change in either is associated with no 
particular change in the other, so that r must in tliis case be zero. 

When r is negative, since {y—y)j(x—x)=^rcrja, and the a’s are 
necessarily positive, corresponding to any value of x above the 
mean of all the x'a the best value of (y—is negative, that is, the 
best value of y is below the mean of all the y’s. and vice versa. 
This means that in general high values of x would be associated 
with low values of y, and vice versa. 

If we take the mean as origin so that the regression tines become 

y=raja^ . x, 
x=ra,jay . y, 

these lines coincide with the axes when the correlation is zero, 
and with one another when r=±l and the correlation is perfect, 
fig. (22). Given two equally 
variable characters {<Tg=Oy) and 
perfect correlation, the rcgres* 
sion lines coincide with one of 
the bisectors of the angle formed 
by the axes. 

It may be helpful to look back 
again now at the graphical view 
of the argument leading up to 
the determination of the co> 
efficient of correlation. For 

successive values of x we calculated the means of the several 
y's observed, these being presumably the best available y’s corre¬ 
sponding to the particular z’s selected, and we as8v.med that, 
when plotted, the points so obtained, {x^, y^), (Zj, y^), 
lay roughly in a straight line. In the same way we calculated the 
means ol the several z’s observed to correspond to particular y’a 
selected, and again we assumed that the resulting points, (z,, y^), 
Vt)* (^i> Vt) ■ • ■ ^7 roughly in a straight line. These assump¬ 
tions are justified in very many cases, but when they fail recourse 
must be had to other luethoda beyond the scope of thiR book. [See 



Fio. (22). 
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for example, Pearson’s paper in Drapers' Company Research Memoirs 
Biometric Series ii., On the Theory of Skew Correlation and Non- 
linear Regression, introducing the correlation ratio, g, which ia 
equal to r in the particular caae when the regression is linear.] 
Sometimes, again, although the observations are so scattered that 
the assumption of a straight line to describe the best fit seems 
somewhat ^vide of the mark, it may be justified on the ground that 
no better graphical result would be given by using any other curve 
in place of the line. Moreover the linear expression, y=inx-{-e, 
is simple and may serve to give at all events the first two terms of 
some more complex relation suppl 3 dng an estimate for the most 
probable y corresponding to a given x. 

If we had plotted all the original pairs of observations, instead 
of plotting certain x’s and the mean y's associated with them, or 

certain y's and the associated mean 
x'b, the two lines of regression would 
not have stood out so clearly : they 
would have lacked definition, like an 
optical image which is not strictly in 
focus, but there would have been a 
concentration of observations, as of 
light, in the neighbourhood where the 
lines of regression intersect, namely 
at (£, y), the mean of all the x’s and 
all the y'a. When, however, the lines of regression lie close together 
they become more clearly defined, all the observations being centred 
then more nearly in one line, and the correlation tends towards 
perfection. Such cases are frequent in Physics but rare, if found at 
all, in that class of Statistics into which the element of human 
impulse enters. When r is less than 1 the lines of regression, if the 
regression is of linear type, will be inclined to one another at some 
angle between 0 and 90 degrees. 

If only a rough value of r, the correlation coefficient, is reqtiired, 
that may be obtained by merely estimating the gradient of each 
regression line and multipl 3 dng the results together, one measured 
relative to the axis of x and the other relative to the axis of y, 
for this product 

= (regression of y on x) (regression of x on y) 
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Such an estimate may also be useful, though it may not be very 
dependable, when the complete distribution of characters is not 
known, for either regression line can be draum when any two points 
on it are known and a single array of values of either character 
corresponding to a given type of the other is sufficient to fix one 
such point; also the mean {x, y), if it were known, would at once 
give a point common to both regression lines. When all the facts 
are available, however, the method of calculation is to be preferred 
to that of simply graphing the observations and their means, as there 
is bound to be a certain amount of guesswork and consequent error 
in deciding from a graph how the best regression lines run. 

It is frequently convenient to refer the deviations of tlie given 
variables to some point other than the mean {£,y) as origin, and, 
when this is done, a correction y 
must be applied to the resulting 
value of r. We have already 
explained how, in such a case, u 


to correct for standard devia¬ 
tions, and, as f=p/<7*ery, it only 
remains to explain how to cor¬ 
rect for p. 

Now p is given by 




• • • +L%, 


Fio. (23). 


where the ^’s and »)’a denote deviations from £ and y respectively. 
Fig. (23) indicates the changes necessary in transferring from some 
origin 0 to the mean G. The co-ordinates of P (representing a 
typical observation) referred to 0 are [x, y) and referred to G are 
(f, 7}). Also the point G itself referred to 0 is {£, p). Thus 

(=X~£, v=y~p, 

and Tip becomes 


(*i—i)(yi—p)-j- . . . -j-(x„—£)(y^_^) 

= ft/i—. . . +(x^„~£y„—yxn-^£p) 

—(3^1!/!+ • • ■ +a:,J(„)-£(yi+ . . . +y„)-^(a:j+ . . . -\- x „)-\- ti £§ 

— (*iyi+ • • • P • wi-fnip 

=2!{xy)-n£§, 

where 2J{xy) denotes the sum of expressions of the type xy. 

Henoe the oorreoted value of p 

=£(xy)fn-£g, 

H 
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from which we infer that the corrected value of r is 

_ S(xy)—nxy 

~ V — nS ~) {21y^— ny^) 

We proceed to a few applications of these results in the next 
chapter. 

f As early as 1846 a French physicUt, Auguste Bravais, had conceiv^ the 
surface of error aa a means of describing in space the path of a i»int whose x 
and y co-ordinates are subject to errors which are not independent; but it 
appears to be doubtful whether he saw the connection between lus work and 
th^ubiect of correlation. It was Galton, nearly forty yeara later, who 
really created that subject, introducing the coefficient of collation on paph- 
ical lines and giving practical examples of its use. (See Biometnka, vol. xm., 
op. 25 -i 5 , Notes on the HisUfry of Correlation.) 

Edgeworth, in 1892. using Galton’s function, independently reached some 
of Bravais’ results related to the correlation of three vanables. and show^ 
how they could be extended. Karl Pearson, in 1896, contributed to the 
Royal Society Transaciiotis a fundamental paper on the subject, with special 
reference to the problem of heredity, drawing attention to the best value of 
the correlation coefficient, and how it should be calculated. (See Appendix, 
Note 11.) Yule, returning in the following year to Bravais’ formul®, showed 
their significance also in the case of skew correlation. 

Pearson afterwards developed a method of determining the <^rrelatioo ot 
characters not quantitatively measurable, and in a discussion of the gene^ 
theory of skew correlation in another paper he proposed a new funcUon, tue 
correlation ratio, applicable to the case of non-linear legreesioD.] 
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CORRELATION—EXAMPLES 


Example (1).—To find the correlation between Differences in Whole¬ 
sale Price Index Numbers and in the Marriage Rale from their corre¬ 
sponding Nine-yearly Averages during the twenty years, 1889-1908. 
using the data given on p. 77. 


Table (26). Correlation between Differences in Wholesale 
Prices and Marriage Rate from their respective Nine- 
yearly Averages. 


(1) (2) (8) (4) (6) (8) 


Ye^r. 

Difference in 
Prieeft from 
9-yearl7 Average. 

k 

Sqnereof ' 
No. iu 
CoL (2). 

Difference in 
M&rrii\|$e-rete 
from y*>*enrlv 
Arersige. 

S<|uare of 
No. in 
Col. (4). 

Proiluct of Noe. 
ill Col. (2) end 
Col. (4). 


(*) 

(x>) 

(I/) 

(j^*) 


(*«/) 

1889 

+ 0-9 

0-81 

+ 1 

1 

+ 

0-9 

1800 

+ 2-3 

6-29 

+ 6 

36 

+ 

13-8 

1891 

+ 7-0 

4900 

+ 6 

36 


42-0 

1802 

+ 2-4 

6-76 

+ 3 

9 

-1- 

7-2 

1893 

+ 20 

400 

- 6 

36 


-12-0 

1894 

- 2-8 

7-84 

- 6 

25 

+ 

14-0 

1805 

- 4-3 

18-49 

- 6 

36 

+ 

25-8 

1896 

- 61 

37-21 

+ 1 

1 


- 6-1 

1897 

- 3 7 

13-69 

+ 3 

9 


-IM 

1898 

- 0-2 

0-01 

+ 4 

16 

( 

- 0-8 

1809 

- 1-6 

2-56 

+ 6 

36 


- 9-6 

1900 

+ 6-3 

2809 

+ 1 

I 

+ 

6-3 

1 1901 

+ 10 

1-00 

• • 

0 9 


• 1 

1902 

- 0-6 

0-25 

+ 1 

1 


- 0-6 

1903 

- 1-4 

1-96 

- 1 

I 

•1- 

1-4 

1904 

- 1-3 

1-69 

- 3 

9 

+ 

3-9 

1005 

- 2-4 

6-76 

- 2 

4 

+ 

4-8 

1906 

- 0-5 

0-25 

+ 3 

0 


- 1-6 

1007 

+ 3*2 

mmm 

6 

36 

+ 

19-2 

1908 

- 1-8 

3-24 

- 2 

4 

+ 

1 

3-0 


+24*1-26*6 

197-17 

1 

306 

•fl41-9-41-6 
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The arithmetic is comparatively simple in this case because 
there is only one value of each variable corresponding to each year, 
so that there is no weighting or grouping to complicate the analysis. 
The variables z and t/, between which we wish to find the correlation, 
appear in col. (2) and col. (4) in Table (26), and the positive and 
negative differences are separated from one another in each case 
80 as to make their summation easier. 

Thus for the arithmetic mean of the numbers in col. (2), wo have 

:c=(+24-l-26-C)/20=—0-125 ; 

and for the mean of the numbers in col. (4), we have 

y=(+4l—25)/20=+0-8. 

The straightforward procedure would now be to get the twenty 
corresponding values of ^ and v, the deviations of the twenty xs 
in col, (2) and of the tw-enty y’s in col. (4) from z and y respectively, 
and, having found cr, and cr^, we could immediately deduce r from 
the formula 

r=pla^y 

= • • ■ +^20V2o)/20o'aC'v- 

But it is simpler to measure the deviations from (0, 0) as origin 
rather than from the mean (—0-125, +0-8), because x®, y^, and xy 
involve fewer significant figures than would and fv, and, 

of course, it will be necessary to correct for this at the end in the 
usual way. 

The mean square deviation of x referred to zero as origin 

= 197-17/20, by col. (3). 

Therefore, 197-17/20- (0-125)2=9-843 

<t^=3-14. 

Again, the mean square deviation of y referred to zero aa origir 

= 306/20, by col. (6). 

Therefore, c^2=306/20- (0-8)2= 14.66 

(7j,=3-83 

Also the corrected j) 

= {Zxy)ln—xy 

= 100-3/20-(-0-125)(+0-8), by col. (6) 
=6-015+0-100 
=5115. 

=6-116/(3-14)(3-83) 

=0-48. 


Hence 
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It is necessary to be careful with the signs in forming the numbers 
in col. (6), but otherwise the actual calculation should present no 
difficulty. 

The regression equation giving the best marriage rate difference, 
Y, for a given wholesale price difference, X, from their respective 
nine-yearly averages is 

(Y-0-8)=r^. (X+0-125) 

<^x 

= (0-43)^^—\x+0-125) 
i.e. Y=0-62X+0*86. 

The regression equation giving the best wholesale price difference. 
X, for a given marriage rate difference, Y, from their respective 
nine-yearly averages is 

(X+0*125)=r^ . (Y-0-8) 

=0-35(Y-0'8) 

U. X=0-35Y-0*40. 

We noted that fig. (10), p. 80, suggested a closer correlation 
between the two factors we have been considering during the 
earlier years of the period 1875-1908 than during the later years. 
It might be worth while as an exercise to see if this is borne out 
by calculating r for the years 1876-1889, and comparing it with 
the value found for the years 1889-1908. 

Example (2).—To find the correlation between Overcrowding and 
Infant Mortality in London Districts. [Data taken from London 
Statistics, vol. 23, published by the London Ck)unty Council.] 

The figures are apparently based upon the Census Report of 
1911. The numbers in col. (2), Table (27), show what percentage of 
the total population occupying private houses in each district were 
living in overcrowded conditions, any ordinary tenement which 
has more than two occupants to a room, including bedrooms and 
sitting-rooms, being defined as overcrowded. The numbers in 
col. (5) show the infantile mortality in each district, that is, the 
number of infants who died under one year out of every 1000 
bom, including both sexes. 

For the sake of comparison these numbers have been plotted 
together on the same graph sheet. The districts, arranged in 
alphabetical order, were numbered from 1 to 29 so as to form a hori- 
Eontal scale corresponding to the scale of years in discussing prices 
and marriages. The scale in this case is, of course, purely artificial. 
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and the only reason for joining up neighbouring points is that we are 
better able by so doing to see whether or not high values of the one 
variable go with high values of the other variable, and low with low. 

In calculating the mean and standard deviation for overcrowding 
we have measured deviations from 17'0 as origin, and in making the 
same calculations for infant mortality we have measured devia¬ 
tions from 125 as origin. It is convenient, therefore, to use the 
point (17-0, 125) as origin in working out also the product deviation 
sum, col. (8) of Table (27), instead of using the mean (17-8G, 126). 


Table (27). Corbelation between Overcrowding and 
Infant Mortality in London Districts (1911). 

,) (2) (3) (4) (6) (6) <7)_W 


\*/ 

Diitrici. 

Per* 

ceiittgi* 

of 

Popnlft. 

tinn 

Over* 

crowded 

% • r 

Deviation of 
No, in Col. (2) 
from 17*0. 

«S-3 

Infant 

Mor« 

tality. 

Deviation of. 
No. inCol. 
from 126. 

Sqnare 
.if No. 
in 

Col. (6). 

Product of Noi. 
in Col. ($) and 
Col. (6). 

(1) Battersea. 

(2) liermuiuiiiey 

(3) Jiethrial Green • 

(4) Camberwell 

(5) Cbelsea . 

(0) City of LondoD 

(7) Deptford . 

(8) Finsbury . 

(0) Fulbam . 

(10) Greenwich 

(11) Hackney . 

(12) UaininerBmitb . 
(Id) Hampstead 

(14) Hulboro • 

(15) Islington . 

1G) Kensington 

17) Lambeth . 

18) Lewisham. 

19) Paddington 

20) Poplar 

(21) St. Marylebonc 

(22) St. Paiicras 

(23) Shoreditch 

(24) Southwark 
. (25) Stepney 

(20) Stoke Newington 

(27) Wandsworth . 

(28) Westminster . 

(29) Woolwich. 

1.T3 

‘2.^4 

U9 

12-3 

122 

3!>'8 

H-6 

12-1 

12'4 

14-2 

71 

2.5fi 

200 

17-1 

136 

39 

16-2 

20-6 

20-7 

2.i-5 

3li-6 

2.0-8 

35-0 

8-8 

6-3 

12-9 

C-3 

(*) 

- 3-7 

■f 6-4 
+ l«-2 

- 3-5, 

- 2-1 

- 4-7 

- 4-8 

+ 22-8 

- 2-4 

- 4-9 

- 4-6 

- 2-8 

- 09' 

+ 8-6 
+■ 30 

-h 0-1 

- 34 
-13-1 

- 0-8 

+ 3-6 

3-7 
+ 8-5 
+ 19-6 
+ 8-8 
+ 18-0 

- 8-2 
-107 
- 41 
-10-7 

1300 

40tNl 

ai244 

I2*2r> 

4*41 

22*09 

2304 

519A4 

6*7(1 

24*01 

2M6 

7*84 

98*01 

7306 

900 

0*01 
11*50 
171*01 
0*64 1 
12*90 i 

1300 
72 25 
384 10 1 
77-44 
324*00 

67*24 

114*49 

10*81 

114*49 

124 

156 
151 
IU!» 
109 

121 

142 

iriB 

125 
128 

119 

140 

78 

115 

127 

133 
123 1 
104 
127 

157 

108 

112 

170 

144 

144 

102 

122 

103 

97 

'“11 
4- 31 
+ 26 

- 16 
- 16 

- 1 
17 
+ 31 

-H 3 

_ 6 

21 

- 47 

- 10 

+ 2 

+ 8 

- 2 
- 21 

•f 2 
+ 32 

- 17 

- 13 

+ 45 
+ 19 

19 

- 23 

- 3 

- 22 
- 26 

1 

961 

670 

250 

25(i 

1 

289 

901 

««9 

9 

30 
441 
22IMI 1 

loo 

4 

64 

4 

441 

4 

1024 

289 

11)9 

2025 

3in 

SCI 

529 

9 

484 

784 

-f 3-7 

4- 198-4 
-4 421-2 
+ 56-0 

Jr 33-6 

-1- 4-7 

- 81-6 

+ 706-8 
... 

- 14-7 

+ 27-6 

- 68-8 

■4 465-3 

- 86-0 

+ 6-0 

+ 0-8 
+ 6-8 

H 

+ 115-2 

- 62-9 
-110-5 

4- 882-0 
+ 167-2 

4- 342-0 

4- 188-6 
-t- 32-1 

4- 90-2 
•I- 2911-e 



-H19 3-94-4 

2519*81 


+ 2-‘>6-226 

12748 

. 

4-4322-9 - 416-1 

1 
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For overcrowding, 

inean=17+24-9/29=l7-86; 

ct,= V[(2519-81/29)-(0-86)»]=V(86-15)=9-3. 

For infant mortality, 

mean=125+30/29=126 03; 

a^=V[(12748/29)-(l-03)*]=A/438-5=20-9. 

Alsop, referred to (17-0. 125)=(4322-9-416 1);29=3907/29, and. 
referred to the mean (17-86, 126-03). this becomes 

=3907/29-(0-86)(1 03) 

= 133-8. 

Hence r=133-8/(9-3)(20-9)=0-69. 

BO that the correlation between overcrowding and infant mortality 


is fairly marked. 



Numbef9 representing vor/ous Londen Oistriets 

Fio. (24). 


The regression equation giving the average infant mortality, Y. 
for districts in which the extent of overcrowding. X, is known is 


Y-126-03=r-''(X-17-86) 

= »^\X-17 86) 

9-3 

i.e. Y=l*55X+98-4. 

Similarly, the regression equation giving the average percentage 
□f overcrowding, X, for districts with a known amount of infant 
mortality, Y, is 

X- 17-86=r-*(Y-126-03) 

a, 

=0-31(Y-12603) 

i . x^o-aiY-ao. 
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Example (3).—The reader might apply the same method to the 
determination of the correlation between Ratio of Indoor Paupers 
and Ratio of Outdoor Paupers, each measured per 1000 of the esti¬ 
mated Population in England and Wales, excluding casuals and 
insane, during the years 1900-1914. The following are the statistics 
required for the purpose :— 


Table (28). Correlation between Ratio of Indoor and Ratio 
OF Outdoor Paupers, each measured per 1000 of the 
Population. 


Tear. 

Indoor 
Paupers— 
Rate per 1000. 

Outdoor 
Paupers— 
Rate per lOOO. 

Year. 

Indoor 
Paupers— 
Rate per 1000. 

Outdoor 
Paupers— 
Kate per 1000. 


6-9 

16-8 

1908 

6-8 

154 


5-8 

16-3 

1909 

7-1 

16-6 


6-0 

15-3 

1910 

7-2 

15-1 

■piBU 

6-2 

15-4 

1911 

7-2 

14-1 

1904 

6-3 

16-4 

1912 

6-9 

11-2 

1905 

6-6 

161 

1913 

6-7 

IM 

1906 

6-8 

160 

1914 

6-4 

10-4 

1907 

6-8 

15-6 





The coefficient of correlation in this case comes out negative 
and = — *15, but it is very small and probably not significant. 
If it were, it would imply that as indoor pauperism diminishes 
outdoor pauperism increases, and vice versa. 


Example (4).—To find the correlation between the Number of 
Cattle and the Number of Acres of Permanent QrasS'land in the Coal- 
Producing Counties of England (1915). 

A Government Report was consulted giving the acreage under 
crops and grass and the number of live stock in each petty sessional 
division in the country, as returned on 4th June 1916, and the 
coxmties included were those which appear in the coal-mining 
reports published monthly in the Labour Gazette. 

In each county the petty sessional divisions with the greatest 
and the least numbers of cattle and of acres of grass-land were 
noted, the numbers being written down to the nearest 1000, and, 
after a rough examination of the range of these variables from 
county to county, suitable class intervals were chosen and a table 
of double entry was drawn up. Table (29), with an empty square 
ready for each possible pair of variables. 












Total Number of Acres of Permanent Grass-land (expressed to nearest thousan^ 
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Table (29). Correlation between the Number of Cattle 
AND THE Number of Acres of Permanent Grass-land in 
THE Coal-Producing Counties of England (1915). 

I \rotal Head of Cattle (expressed to nearest thousand) _ 


y^ 

0-6 

5->0 

ys 

10-15 


/ I 

0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40, Mean x 


: 15 

: _ 

i .0 ^ 

\ -27 3 

: \it6 i 

“i ' • _ r 

: s 6 :« 3 

i 130 :: 18 
; ;i9o ; ; 54 

4 l:ia 
3 n :3C 


15-20 I 1: 



20-25 


25-30 


30-35 


35-40 


140-45 


45-50 


60-55 


55-60 


60-65 


66-70 


ro-75 


7S-80 _ 

6D-BS 


Toiai$ 70 


:: :25 

! : :»5 


12 \ 6 




64 24 


2043 33-24 


00-00 50>S0 67*50 57*5 
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Each petty sessional division was then considered in turn and a 
dot was inserted in the particular square applicable to it: e.g. a 
petty sessional division with 42,000 acres of grass-land and feeding 
19,000 cattle would be represented by a dot in the square defined 
by row (40-45) and col. (15-20) in Table (29); z was used to repre¬ 
sent the number of cattle and y the number of acres of grass-land 
in any division, each expressed to the nearest 1000 units. All the 
•Jots were ultimately added in each square giving the frequency 
for each corresponding pair of variables, and these frequencies were 
recorded in the centres of the squares to which they applied : e.g. 
the frequency of petty sessional divisions stocking 10 to 15 thousand 
cattle and with 30 to 35 thousand acres under permanent grass 
was 22. The total frequency for each row, i.e. each array of 
selected y ty-pc, was also noted, in the column at the end of the 
rows: e.g. altogether 31 petty sessional divisions were observed of 
the type having 30 to 35 thousand acres of land under permanent 
grass. Likewise the total frequency for each column, i.e. each 
array of selected x type, was noted in the row at the foot of the 
columns : e.g. altogether 54 divisions were observed of the type 
stocking 10 to 15 thousand head of cattle. 

It was possible now to treat each column separately and to 
calculate the mean y’s associated with different tj’pes of x, namely 
^ 2 ’ ^ 3 . • • • . the frequencies so obtained were inserted in 
the bottom row of Table (29): e.g. when x lies between 20 and 25 
thousand, the mean value of y is 60 thousand. The resulting 
points—(arj, p,), (Xj, pj), (Xg, y^) . . . in the notation of Chapter x.— 
are plotted together in fig. (25), and they are seen to lie approxi¬ 
mately in a straight line. The successive rows were treated in 
precisely the same way and the mean x’s calculated corresponding 
to ^’s of different types, namely y^, y^, ys, . - • , the frequencies 
obtained being recorded in the extreme right-hand column of 
Table (29): e.g. when y lies between 45 and 50 thousand, the mean 
value of X is 19 thousand. The resulting points (Xj, y,), (Xg, yg), 
(^ 3 . ya). • • • . are also plotted in fig. (25), and, excepting for values 
which depend upon only one or two records, they too lie roughly 
in a straight line which is not far from coinciding with the previous 
one, so that we shall expect on calculation to get a high value for 
the coefficient of correlation. 

In order to calculate r we need first to find the mean and standard 
deviation for each variable. For this let us take as origin the 
point (12-5, 27-6). The essential details are shown immediately 
below the relative Tables (30) and (31). 
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Table (30). Distribution of Petty Sessional Divisions ac¬ 
cording TO THE Head of Cattle (expressed to neabest 

1000) STOCKED. 


(1) (4) (»)_W_ 


% * 

No. of Cattle 
stocked (io 1 
tbousaode). 

9 * 

Devia* 
tiui) from 
12-5. 1 

No. of Pcttj 
Se»9ional 
Divisions. 

Product of 
Nos. in 
Cole.(2)&(3;. 

Product of 
Nos. in 
CoU.(2)«c(4). 

0-5 

(*) 

-2 

76 

-152 

304 

6-10 

-1 

97 

- 97 

97 

10-15 

0 

64 

# » 

# # 

15-20 

+ 1 

24 

+ 24 

24 

20-25 

+2 

14 

-1- 28 

56 

25-30 

+ 3 

6 

+ 15 

45 

30-35 

+ 4 

6 

+ 20 

80 

35-40 

+ 5 

1 

+ 5 

25 



276 

-157 

631 


Mean number of cattle=12*5——9-66, since £— class 

units referred to 12'5 as origin; and <t*=5\/[5tt (jtb)*] 

=6\/r9G3=7-00. 

[The numbers in col. (4) may be spoken of as the first moments 
of the totals of x arrays and the numbers in col. (5) as the second 

moments.] 

In order to calculate easily the product deviation with reference 
to (12*5, 27-5) as origin, the value proper to each square was inserted 
just above the frequency and the product of the deviation by the 
frequency was inserted just below the frequency in different type of 
print to prevent confusion : e.g. the row (50-55) is -}-5 class intervals 
distant from the row (25-30) containing the origin, and the column 
(20-25) is +2 class intervals distant from the column (10-15) con¬ 
taining the origin ; hence, for the particular square defined by this 
row and this column, the product deviation=5x2=10; also 
the frequency recorded in this square=4, so that it supplies a 
term 10x4 to the product deviation ; the numbers 10, 4, and 40 
are therefore the numbers which appear in the square. It is neces¬ 
sary to be careful with the signs ; if the product deviation is to 
be positive, the separate deviations must be of like sign, both 
positive or both negative : hence they must either be both above 
or both below the numbers 12-5 and 27-6 respectively from which 
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they are measured. In this instance there are only two negative 
terms among the product deviations in the whole table. 


Table (31). Distribution of Petty Sessional Divisions ao- 

CORDING TO THE NUMBER OF ACRES OF LaND (EXPRESSED TO 
NEAREST 1000) UNDER PERMANENT GraSS. 


(1) (2) (3) (4) (6) 


No. of Acres 
uoder Grass 
(iQ thousaode). 

Deviation 
from 27*5. 

No. of Petty 
Sessional 
Divieiona. 

Product of 
Nos. in 
Cols. (2) & (3). 

Product of 
Nos. in 
Cols. (2) & (4). 

0- 6 

it/) 

- 6 

16 

- 75 

375 

5-10 

- 4 

30 

-120 


10-15 

- 3 

48 

-144 

432 

16-20 

- 2 

33 

- 66 

132 

20-25 

- 1 

30 

- 30 

30 

25-30 

• • 

26 

# # 

• • 

30-35 

+ 1 

31 

+ 31 

31 

35-40 

+ 2 

23 

+ 46 

92 

40-45 

+ 3 

8 

+ 24 

72 

45-50 

+ 4 

10 

+ 40 

160 

60-65 

+ 6 

9 

+ 45 

225 

65-60 

+ 6 

5 

+ 30 

180 

60-65 

+ 7 

1 

+ 

7 

49 

65-70 

+ 8 

1 

+ 

8 

64 

70-75 

+ 9 

1 

+ 9 

81 

75-80 

+ 10 

3 

+ 30 

300 

80-85 

+ 11 

2 

+ 22 

242 



276 

-143 

2945 


Mean number of acres=27*5~Tllx6=24-91, since p=—h7i 
class units; and (7v=5-v/[¥rir—(74T)^]==5-\/lb 402=16-12. 

[The numbers in col. (4) are the first moments of the totals of y 
arrays, and the numbers in col. (5) are the second moments.] 

It is now a simple matter to sum the product deviation terms, 
taking each column (or each row) in turn: e.g. the first column 
gives 

150+216+180+12=658; 
the second column gives 

12+64+60+25-6-2=143. 
and so on ; and, summing these results together, we get 

558+143+76+126+96+160+30=1189. 








CORRELATION—EXAMPLES 


125 


But this is the sura of all the product deviations referred to 
(12-5, 27-5) as origin. Transferring now to the mean, we have 

—ary 

1 1 a 9 / 1 6 7x/ i« 3^ 

= S7i —( —VT«K —TTS) 

=4-013, e.rpre3sed in class units. 

Hence, r=j>/c7*av, 

where and cr^ are also to be expressed in class units, 

=4-013/V(1-963)V(10-402) 

=0-89, 

a result not far from unity, so that the correlation is high. 

The regression of ‘ acreage of grassland ’ (Y) on head of cattle 
(X) is given by 

(Y-24-91)=r^(X-9-66) 

= (0W—'^^\x-9-66), 

' (7*00) 

i.e. Y=205X+511. 

The points representing the mean y’s for x’s of different types 
should lie close to this line which is shown in 6g. (25). This equation 
enables us to predict the acreage under permanent grass to be 
found on the average in petty sessional divisions with a given total 
head of cattle in each. The words ‘ on the average,’ to be tacitly 
understood even if not stated in all such cases, are emphasised 
because the prediction relates to the whole array of divisions of a 
particular type, and as it only professes to give the mean or most 
likely result it is not to be pronounced worthless if it fails in an 
individual trial with a selected division. 

Again, the regression of X on Y is given by 

(X-9-66)=r^(Y-24-91) 

Of, 

i.e. X=0-39Y+005, 

which tells us the total head of cattle (X) to be found on the average 
in petty sessional divisions when the acreage under permanent 
gross (Y) is known. This line is also drawn in fig. (25). 

Example (6).—^The data for this example are taken from an 
exceedingly interesting Government Report on the Cost of Living 
of the Working Classes {Report of an Inquiry by the Board of Trade 
into Working Class Rents and Retail Prices together with the Rates 
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of Wages in certain Occupations in Industrial Towns of the United 
Kingdom in 1912 in continuation of a similar Inquiry in 1905. 


10 20 30 40 X 

Total Head of Cattle (expressed to nearest thousand) 

Fto. (25). 


Od. 6955). Some further particulars concemiug this Report will 
be found on p. 281. 
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The towns included in the inquiry numbered but in five 
Lnstauces it was found desirable to consider closely adjacent muni- 
cipalities as single towns thus reducing the number of town-umts 
to 88, namely 72 in England, 10 in Scotland, and 0 in Ireland. In 
the example which follows the three zones of London, middle, 
inner, and outer, have been treated as separate towns, so making 
the net number of town-units 90. This number is too smaU to 
allow any real value to be attached to our results, but the fewness 
of the observations makes them easier to deal with as an illustration 
of method. 

We begin as before by choosing convenient class intervals for 
the two factors we propose to consider, namely, Increment of Un¬ 
skilled Wages and Increment of Rents—hy increment in each case 
is meant the percentage increase (+) or decrease (—) between 
1905 and 1912—and then form a correlation table. In the last 
example separate tables were drawn up to find means and S.D. s, 
but that was only done in order to keep the argument clear at its 
fiirst presentment: generally we may dispense with these additional 
tables and show all the working in one (see Table (32)). 

The increment of wages runs from ( — 2-5) per cent, to (+11*5) 
per cent., so that, if wo take (—0-5) as origin and a difference of 
2 per cent, as unit, the classes run from (—1) to (+6), these numbers 
being shown in different type in the table, but in the same com¬ 
partments as the others. In the fourth row from the bottom 
are shown the total frequencies for x arrays from class (—1) to 
class (+6), and in the row just below it these several frequencies 
are shown multiplied by their corresponding deviations measured 
from (—0'5) as origin in terms of the class unit—the resulting 
numbers give the first moments of the totals of x arrays. These 
numbers, multiplied again by their corresponding deviations, give 
the second momenU of the totals of x arrays, and appear in the 
last row but one of the table. 

We deal in exactly the same way with increment of rents ; a 
percentage increment of (—1) is taken as origin from which devia¬ 
tions are measured, a difierence of 3 per cent, is taken as unit, 
and the different classes then have deviations running from (—3) 
to (+6). The totals of y arrays, the first moments, and the 
second moments of these totals appear in the last three columns 
on the right-hand side of Table (32). 

To calculate the deviation products, numbers were inserted in 
each square on the same principle as in the last example, and the 
sums of these products for each x array, that is for each column. 
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are given in the bottom row of the table- -1, 0, 14, 8, etc., making 
in all a total of 126. 

Table (32). Correlation between Increment of Unskilled 
Wages and Increment of Rents in certain Industrial 
Towns of the United Kingdom. 



Totals of X arrays 


1st. moments of 
X arrays 


2nd» moments of 
X arrays 


Broduct Sum» of . 

X arrays * 


2 4S 


2 


S I 25 


6 36 


T $0 75 305 





8 

24 

21 

28 

1 

40 

6 

125 

8 

48 

63 

112 

200 

36 

469 

t4 

6 

9 

52 

SO 

-6 

126 7 



The necessary calculations are as follows ;— 

1. Mean a:=—0-6-l-2(126)/90=2-28. 

o^.=2vTw-(^)*]=2V(26585)/9a 

2. Mean j/=—l+3(75)/90=l-60, 

<y»=3\/[w-(T4)*]=3V(21826)/90. 


1966 


{t^)(tt)=——, expresaed fn data 
(90)* 
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Hence T^-pjojOy 

_1965^, 90 ^ 90 

~(W :^26^) V(21825) 

=0-08. 

In substituting for and to find r we have omitted the factors 
2 and 3 respectively, because the S.D.’s have to be expressed in 
the same units as p. Alternatively, if we worked with a difference 
of 1 per cent, as unit, instead of taking a difference of 2 per cent, 
as unit for x deviations, and a difference of 3 per cent, as unit for 
y deviations, each individual product of a; and y deviations would 



Fio. (20). 


have to be multiplied by 2 x 3. Thus p would then be 6 x 1965/(90)*, 
and we should get the same result for r as before by taking a, 
and Uy as in (1) and (2) above. In this case r is so small as to be 
quite insignificant of any correlation between the two factors dis* 
cussed, and the regression lines should therefore be not far from 
perpendicular to one another. 

The regression of y on x, or the equation giving the most probable 
y for a given type x is 

(y_l.60)=f?2(»—2*28), 

S'=0'll3:+1*25. 

1 
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Similarly, the regression ol x on y is 

^= 0 - 06 y+ 2 - 2 . 

To draw the first line we note that it passes through the points 
(0, l-2o) and (5, l-S); also the second line goes through the points 
12-2, 0) and (2-5, 5). The two lines intersect at M (2-28, 1'6), the 
mean of the distribution. They are drawn together in fig. (26). 


Table (33). Correlation between Unskilled Wages 
AND Rents in certain Industrial Towns of the 
United Kingdom. 



Example (6).—Instead of discussing the Changes in Wages and 
Bents between 1905 and 1912, it might be of interest to find the 
correlation between index numbers representing Actual Wages and 
Bents in October 1912, taken from the same Report. The necessary 
data for this purpose appear in Table (33) showing the distribution 


of frequency between the different classes : e.g. seven towns were 
observed in which the index number for wages was between the 
Limits (79-84) and the index number for rents was between the 
limits (63-60). The wages figures quoted in Table (33) refer only 


to unskilled labour in the building trade; the inquiry actually 
embraced certam occupations in the boildine. engineering, and 
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printing trades, these having been selected as industries which are 
found in most industrial towns, and in which the time rates of 
wages are largely standardised. 

Table (34). Cokrelation between Increment of Working 
Class Prices and Inore.ment of Working Class Rents 
IN certain Industrial Towns of the United Kengdom. 




X ~ PeKcntatje Increment of Prices 


4 

1 

7 G ^ 

0-5 , IIG 13*5 

IS s 

17 5 ; 

19-5 

1 V" Percentage Increment of Pent$ 

-10 

1 

1 


1 

1 

1 

. 

1 

1 

1 


-7 

1 

1 1 


1 

1 


2 i 

1 

1 

•4 

i 

1 

*“ f 

t 

2 i 

2 

o 

— j 

1 

1 

2 

1 

-1 

1 

4 

6 

1 

10 

1 

1 

3 

1 

1 

1 

1 1 

' 1 

2 


1 

1 

1 

1 2 

1 

6 

8 

1 

1 

; 1 

5 

2 


4 

; 2 

1 

u 


1 

1 

8 
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2 

3 

! 

4 

1 

11 




1 

. 1 
1 

1 1 

1 

1 

14 


1 

1 

! 1 


1 


f7 




I 





The cocfBcient of correlation turns out to be 0*46, distinctly larger 
than in the previous case. Also the lines of regression are :— 

(1) y=0*47a:+21. (2) a;=0*45t/-i-66. 

ExatrvpU (7).—The Report also furnishes data for evaluating the 
correlation between the Increnwni of Working Class Prices and 
Increment of Working Class Rents, again meaning by increment the 
percentage increase (+) or decrease (—) between 1905 and 1912 
(see Table (34)). 

The correlation in th>g case is very small, being only 0*13. The 
regreaaon equations are 

(1) y»0*22x-l*fi. 


(2) xs0 07y-f It. 









PART II 


CHAPTER XII 

INTRODUCTION TO PROBABILITY AND SAMPLING 

Suppose we wish to know the average measurement of some organ 
or character, t.g. length of forearm or weight or anything s imil ar, 
in a largo population containing several thousand individuals. The 
mean obtained by actual measurement if it were practicable to 
carry it out on so large a scale, would evidently depend to some 
extent upon the sex, the race, the age, the social class, and so on, 
of the individuals selected, and we shall accordingly assume our 
population to be composed of individuals of the same race and sex, 
at about the same age, taken from the same class, etc. ; it would be 
impossible in practice no doubt to secure that all conditions should 
be identically the same for all the individuals observed, but the 
population may be as homogeneous as we care to make it in theory. 

Now suppose that, instead of attempting to measure every single 
individual, a random sample of 1000 from among the population 
be taken and that the mean and variability of the measurements 
for this sample be calculated, giving results and o-i- With 
these may be compared and 02 , the results of measuring a second 
sample of 1000 individuals, and og, the results of a third sample, 
and so on. It is extremely unlikely that the values obtained for 
the m's in this way will equal one another, neither will the as 
be equal; but, if we have succeeded at the beginning in avoiding 
all ill-balanced influences when we tried to make the field of 
observation as homogeneous as possible, the resulting m’s and a’e 
will only differ from the values of the mean and variability for the 
whole population, assuming they could be measured, within a 
comparatively small range. 

Differences of this kind, which arise merely owing to the fact 
that we are often obliged in practice, for lack of time or means, to 
leal with a comparatively small sample instead of with the whole 
population of which it forms a part, are said to be due to random 

18t 



INTRODUCTION TO PROBABILITY AND SAMPLING 133 


sampling. Granted that the samples themselves are adequate in 
size (containing, say, from 500 to 1000 individuals each) an esti¬ 
mate of differences to be expected between one and another can bo 
made, and unless the observed differences fall outside recognized 
limits it is said that they are not significant of any difference other 
than such as might quite well be accounted for by random sampling 
alone. 

In theory, then, we can imagine a large number of such random 
samples selected, and by determining the S.D- of their means, 

m.^, wig, . . . , we should have a fair measure of the deviation 
which might quite well occur from the true value, that is, from the 
mean of the population as a whole, through working only \vith a 
sample. Further, a range of two or three times the S.O. on either 
side of the true mean ought to take in the majority of the sample 
means observed. 

Exactly the same principle holds good in dealing mth the pro¬ 
portion of individuals in a given population which can be as.<?igncd 
to a particular class, or in discussing the S.D. of the distribution, or 
the C. of V., or a coefficient of correlation, or any other statistical 
constant, no matter what the nature of the character may be which 
is measured or observed, or whether it relates to animate or inani¬ 
mate objects. Take, for instance, the variability—by selecting 
several samples from a given population we get a series of values 
^ 1 , aa, a, . . and in the S.D. of this distribution of variabilities 
we have a measure to which we can compare the deviation of any 
sample variability, from the true variability of the whole popu¬ 
lation, while a range two or three times the S.D. might be expected 
to include the majority of the different variabilities met with in 
the samples. 

Although the S.D., as we have explained, provides quite a suit¬ 
able measure of the extent of deviation of a sample constant from 
its true value in the population as a whole, in practice, owing to 
the historical development of the theory having followed the track 
of the normal curve of error [see Chapter xviii.] a measure known 
as the probahle error and equal roughly to two-thirds of the S.D. 
is not seldom employed in its place. The main, if not the sole, 
justification for retaining this measure is that it has established its 
position by long usage, and in any case it is very easily deduced 
from the S.D. by the relation 

p.e.=0*6745 S.D., 

which follows at once from the normal curve and is only strictly 
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justified when the distribution is normal (see p. 246). Let it suffice 
here that instead of simply using the S.D., as might now seem 
the obvious course, some writers prefer to multiply the S.D. by a 
certain fraction, in W'hich tlicre is no particxilar virtue except that 
which arises through honourable descent, and to work with the 
‘ probable error.’ 

Since we do not know how much weight to assign to any result 
unless the magnitude of its p.e. is also given, results are frequently 
stated in the follouing manner : in a study of the Variation and 
Correlation in the Earthworm, by R. Pearl and W. N. Fuller [Bio- 
metrika, vol. iv. pp. 213-229] 

Mean length of worm = 19*171±0-094 cms., 

S.D.=3 077±0-067 cms., 

C. of V.= lG-049i0-356 per cent., 

meaning that the mean length of the worms measured was 19'171 
cms., subject to a probable error of 0-094 cms. which might be in 
excess or defect, in other words the mean length lay probably some¬ 
where between 

19-077 cms. and 19-2Co cms.; 

similar remarks apply to the variability, absolute (S.D.) or relative 
(C. of V.). 

When the standard deviation (p.e./0-6745) is used as the measure 
of error due to simple sampling, the fact is generally recorded, and 
it is sometimes spoken of as the standard error in that connection, 
but, as it seems unnecessary to multiply names for ideas which are 
not really new, only that they appear in a new setting, we shall 
not employ the term. 

It must be clearly understood that no outstanding and predict¬ 
able cause exists, by our h 3 rpothesis, for such differences as occur 
in the statistical constants between one sample and another ; they 
are the resultant effect of a complex of forces which cannot be 
properly traced, still less measured, apart from one another, and 
which have been happily described as that ‘ mass of floating causes 
generally kno^vn as chance.’ Since therefore the forces coming 
into play, under the ideal conditions formulated, are of the same 
chance nature as those affecting the spin of a well-balanced coin 
or the selection of a card from a smooth and well-shuffled pack, 
it may be expected that the resulting distribution of means, 
mj, wij, . . . , of S.D.’s, < 72 , 0 - 3 , ... , and of all the other 

constants will likewise be subject to the same laws of probability 
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which serve to describe ^dthin limits what happens in the case of 
coin or card It follows that some acquaintance with the first 
elements of mathematical probability is essential if one is to under- 
stand the theory of sampling, and a short digression must here 
be made in order to introduce that subject. This will be found 
to lead directly to a solution, under certain prescribed conditions, 
in the simple case when the character observed is an attribute like 
complexion, fair or dark, or like birth, male or female, which can 
only fall into one of two definite classes and when every one observe- 
tion in the sample is independent of every other. In the more 
general case where the character observed is capable of direct 
measurement and may lie in magnitude anywhere along a scale 
of values divided up into a number of diflerent classes, it is not 
BO easy to determine the effect of random sampUng, because it is 
not possible, as it is in the previous case, actually to draw up a 
frequency table describing in detail the character of the distribu¬ 
tion to be expected from theory in any given sample. 

The idea contained in the word jirobability is one familiar to us 
in our everyday talk, but if wo seek to analyse it as used we find 
it as elusive as the personaUty of the user. A remarks : \\ars 
wUl probably be stamped out, Uke duelling, in the course of time^ 
B replies : ‘ No ! fighting will probably go on as long as the world 
lasts—you can’t change human nature.’ Now the amount of 
credence wo are prepared to give to each of these statements is 
vague and uncertain until we know something about A and B 
themselves and the value of their judgment, quite apart from the 
influence of our own opinion upon the matter ; perhaps A is an 
optimist or B is a pessimist, and in estimating the ‘ probably ’ 
used by each wo must allow for these facts. Probability, then, in 
ordinary conversation, is something largely subjective : it has a 
varying significance according to the person who uses the word 
and, unless we could get rid of this personal element, it would be 
hopeless to try and approach it along scientific lines. 

Mathematical probabUity is unlike colloquial probability in that 
all the uncertainty is taken out of it, or at least the uncertainty is 
confined within defined Umits. We shall only touch the fringe of 
the subject in this book, and what we have to say may be best 
introduced by considering some examples which may appear trivial, 
but they possess the merit that no personal bias can enter mto 
their discussion to distort the results. The reader must not be 
impatient at their artificial character : in many, if not in all, 
branches of science, before tacUing any particular problem as it 
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actually exists, it is helpful to examine what can be deduced in a 

simple case free from all complication, and, having settled that, 

we try to see how the results are affected when we come to allow 

one by one for the various complicating factors which exist. For 

example, in Astronomy, the track of a jJlanet in apace may first be 

found on the hypothesis that the sun alone is the compelling influence. 

Then we may proceed to discuss how it is deflected from its path 

when the gravitational influence of neighbouring planets also is 
taken into account. 

Let us start with an ordinary pack of playing cards, and, after 
shuffling, turn up one card. Can we measure the probability that 
this card shaU be (1) the 7 of spades 1 (2) some spade ? 

Altogether there are 52 cards, and we will suppose that the 
cards are so cut and so smooth that each of the 52 has an equal 
chance of being turned up : for instance, there is to be no sticki¬ 
ness or anything to help any particular card to evade us by sticking 
fast to its neighbour. Now we are certain to turn up s(mt card 
and there are 52 different possibilities, each of them hy hypolh^is 
equally probable. If, then, we agree to denote certainty by unity, 
we must divide 1 into 52 equal parts and assign one part to each 
card as the probability of its appearance. 

1. The probability (or chance as it is sometimes called) of turning 

up any stated card, such as the 7 of spades, is therefore 1 out of 62, 
i.e. 1/52. 

2. Again, since there are 13 spades in all, the chance of turning 
up some spade is 13 out of 52, i.e. 13/52=1/4. 

These results may be put in another way which is often useful. 
If the experiment is repeated a great number of times, a return to 
the initial conditions of the problem being made after each trial 
by replacing the card drawn and reshuffling the pack, we should 
expect to turn up the 7 of spades on the average about once in 
every 52 experiments, and we should expect to turn up some spade 
on the average about once in every 4 experiments. This must 
not be taken to mean that in 4 experiments we are sure to turn 
up just one spade—a trial will readily prove such a statement to 
be untrue but that, if we went on performing experiment after 
experiment, we should in the long run get a proportion of about 
1 spade to every 4 experiments and a trial will likewise prove the 
truth of this statement. 

Generally, when an event can happen in n different ways alto¬ 
gether, and among these different ways there are a which give 
what might be called successful events, the probabilitv of succes? 
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at any single happening is a out of n. i.e. a/n, and is usually denoted 
by the letter p, and the probability of failure is (n—a) out of n, 
i.e. (n—a)/n, and is usually denoted by the letter q. 

Clearly {p+q)=l, and this is reasonable because we are eertain 
to get either a success or a failure at a single trial and unity was 
fixed as the measure of certainty. In k trials, the probable number 
of successes would be kp and of failures kq, because in n trials, on 
the average, there are a, or np, successes and (n—o), or nq, fadures. 

Example (1).—In the second case considered above, the pro¬ 
bability of success (turning up a spade) is a out of n 

=a/n= 13/52=1/4=3?, 

and the probability of failure (not turning up a spade, i.e. turning 
up one of 39 other cards) is (n— a) out of n 

= (n—a)/n=39/52=3/4= q. 

And (P+5)=1/4+3/4=1. 

Example (2).—What is the chance of drawing either a picture 
card or an ace from the pack at a single trial ? 

Altogether there are 12 picture cards, and the chance of dra\ving 
any one of them is thus 12 out of 62 

= 12/52=3/13; 

and the chance of drawing any one of the 4 aces is 4 out of 62 

=4/52=1/13. 

Hence the total probability required 

=3/13-1-1/13=4/13. 

GeneraUy, if the probability of one type of event is and the 
probability of a second type of event is p^, and if either type is 
reckoned a success, then the total probability of success is ip^+p^). 
This evidently holds good however many different types there 
may be, and even if there is only one event of each type. 

(^nsider now the simultaneous happening of two events, one of 
which can happen in n different ways, a among which are to be 
regarded as successful, and the second can happen in n' different 
ways, a' among which are to be regarded as successful. Further, 
the two events are to be absolutely independent of one another 
in the sense that neither is to influence the success or failure of 
the other. What is the probabflity of a double success occurring ? 

The total number of different combinations of the two events 
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possible is nn', for any one of the n possible happenings for the 
first event can he combined with any one of the n possible happen¬ 
ings for the second event. Also the total number of different 
combinations of two .successes possible is aa\ for any one of the 
a possible successes for the first event can be combined with any 
one of the a’ possible successes for the second event. Hence, 
according to our definition of probability, the probability of a double 
success is aa' out of nn'—aa'jnn'=^{a!n){a'ln’). 

Thus to get the probability of a double success for a combination 
of two independent events we must multiply together the separate 
probabilities for the success of each event taken by itself. 

Similarly, in the above case, the probability of a double failure 
= {n—a){n'—a')/nn' ; and the probability of one success and one 

failure 

a n'—a' n—a a' 

= - • -;—H • —» 

n n n n 

for the first event can be a success and the second a failure or the 

first a failure and the second a success. 

Here, again, if we take all the different possibilities into account, 
and add the probabilities corresponding to each case, we arrive 
at certainty, the measure of which is unity, thus :— 

probability of 2 successes =aa'jnn\ 

„ 1 success and I failure=a(n'—o'l/nn'+a'ftt—®)/”” 

„ 2 failures ={n—a){n'—a')jnn'. 

Therefore total probability, all cases, 

aa' a(n'—a') a' {n—a) {n—a){n'—a') 

= —A — —;—n-- H-;- 

nn nn nn nn 

= (aa'+a7i'—aa'-ha'n—a'a+nn'—Tia'—an'+aa')/nn' 

^nn’jnn' 

= 1 . 

Example. —Take two packs of cards. What is the probability 
of drawing an ace from the first pack and a king, queen, or knave 
from the second pack ? 

Here a=4, n=52, a'=12, n'=62 ; hence the required probability 

= aa'}nn'= 4/52 x 12/52=3/169 = 1/56J. 

» _ 

Thus we might expect to succeed on the average about once un 
56 trials. 
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We proceed to discuss the case of a coin spun a number of times 
in succession, and we shall find the probabilities of the appearance 
of so many heads (H) and so many tails (T) in so many spins on the 
hypothesis that the coin is perfectly balanced and equally likely 
to fall on either side. 

In I spin there are 2 possible events, namely H or T, which 
we shall write simply as 

(H. T). 

In 2 spins there are 4 possible events, because we can combine 
the H or T of the first with an H or T at the second spin, and we 
may express the result thus 

(H, T)(H, T)=(HH, HT, TH, TT); 

the interpretation of which is that we may get either bead followed 
by head, or bead followed by tail, or tail followed by head, or tail 
followed by tail. 

In 3 spins there are 8 possible events, because we can combine 
the 4 events previously possible with an H or T at the third spin, 
thus getting 

(H, T)(H, T)(H. T) 

= (H. T)(HH, HT, TH. TT) 

=(HHH, HHT, HTH, HTT, THH, THT, TTH, TTT); 

the interpretation of which is that we may get either 3 heads in 
succession, or 2 heads followed by 1 tail, or bead followed by tail 
followed by head, and so on. 

In 4 spins there are 16 possible events, because we can combine 
the 8 events previously possible with an H or T at the fourth spin, 
thus 

(H, T)(HHH, HHT, HTH, HTT, THH, THT, TTH, TTT) 

= (HHHH, HHHT, HHTH, HHTT, HTHH, HTHT, 
HTTH, HTTT, THHH, THHT, THTH, THTT 
TTHH, TTHT, TTTH, TTTT). 

But the method here adopted to get the possible events at each 
stage is precisely the same as that which gives the successive terms 
in the ordinary algebraical expansions of 

(H+T), (H+T)(H+T), (H+T)(H+T)(H+T), etc. 

Also each new spin has the effect of doubling the number of possible 
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events obtained at the previous spin, and we conclude that in 
n spins, there are 

(2x2x2x . . . to factors), 

or 2", possible events, and these events are given by the successive 
terms in the expansion of 

[(H-j-T)(Il+T){H+T) ... to n factors.] 

Let us now con.sidcr the probabilities of the different events 
obtainable. The important point to notice is that at any stage 
eacli j)ossible event has exactly the same probability, for there is 
no reason why any particular spin should give H rather than T, 
or T rather than H : for example, in 3 spins there are 8 possible 
events, each by itself equally probable, and we therefore divide 
the unity of certainty into 8 equal parts and assign one part to each 
event, thus 

probability of 3 heads—HHH=J 
probability of 2 heads and 1 tail—HHT=J 

HTH=i i 
THH=J 

probability of 1 head and 2 tails—HTT=J 

THT=J I 
TTH=jJ 

probability of 3 tails-TTT=J. 

It is clear from this arrangement that, if the order of the appear¬ 
ance of H and T is indifferent, some events are of the same type 
and some types are likely to appear oftener than others, e.g. the 
probability of getting ‘ 2 heads and 1 tail ’ (or ‘ 1 head and 2 tails ’) 
is three times as great as the probability of getting ‘ 3 heads ’ 
or ‘ 3 tails.’ Hence for conciseness it is convenient to adopt the 
ordinary index notation and ^vrite 

HHH=H=>, HHT=H2T, HTH=H2T, etc., 
so that the possible events in 3 spins are 

B.\ SH^T. 3HT2, T"; 

in 4 spins they are 

4H3T. 6H2T2, 4HT3. T*; 

and so on. 

The probability of any particular t 3 q)e is now readily written 
down : e.g. in 4 spins, the probability of getting 2 heads and 2 tails 

= (number of successful events possible)/(total number of events 
possible) 

=6/2*=6/16=i. 
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But the binomial expansion always sums together terms of the 
same type for us in just the manner wanted, and we have the 
possible events in n spins given by the successive terms in the 
expansion of 

(H4-T)(H+T)(H+T) ... to n factors, 
i.c. (H+T)", 

t.e. . H’-iTi+'*CaH«-2T2+ . . . +T", 

and therefore again the probability of any particular combination 
is readily written down : e.g. probability of ‘ (w—2) heads, 2 tails ’ 
= (number of successful events possible )/(total number of events 
possible) 

="€ 2 / 2 ''. 

Another way of stating the result obtained is to say that we 
might expect to get 

n heads appearing on the average about once in every 2" trials, 
(n—1) heads, 1 tail ,, „ ,, "Cj times „ „ 

(n—2) heads, 2 tails „ „ „ "C^ times ., „ 

and so on. 

If, in accord with our previous notation, we call the appearance 
of, say, H at any spin a * success,’ and label its probability ^ by the 
tetter p, and if consequently the appearance of T at any spin is a 
‘ failure,’ its probability, to be labelled by the letter q, we have the 
probabilities of the different combinations of events in (H+T)", or 

H"+"C,H"->T>+"CjH"-2T2+ . . . +T". 
given by the corresponding terms in (p+g)", or 

p"+"CiP"-‘g‘+"Cjp"-»g»+ . . . 

where p=q=\. 

After each spin of the coin in the case considered the distribution 
of probabilities was symmetrical, e.g. after the fourth spin the pro¬ 
babilities were 

14 4 4 1 

Tff* T»» TT» Ty» Tff 

We pass on now to a case where the distribution is not symmetrical, 
owing to the fact that p emd q are no longer equal for any isolated 
event. 

Consider the throw of an ordinary die in which each of the six 
faces is assumed to have an equal chance of appearing uppermost. 
The probability of throwing, say, a 3 is 1/6, since we are certain 
to throw either 1, 2, 3, 4, 6, or 6 ; and the probability of failing to 
throw a 3 is 6/6, since we are certain either to throw a 3 or not 
to throw a 3. 
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If we represent the probability of success (say, in this case, 
throwing a 3) by p [i.e. 1/6), and failure {i.e. in this case, failing 
to throw a 3) by g {i.e. 5/6), we have 

p+g^l/6+5/6^1. 

Bearing in mind then that the probability for a combination of two 
independent events is determined by multiplying together the 
8e})arate probabilities for each, we have the following table showing 
what might be expected when 1, 2, or 3 dice are thrown up together, 
where s stands for success and / for failure :— 


No. of 
Dice 
thrown. 



Different 

Posaibilities. 

Different 

Frobabilitiea. 

8,f. 

S3, Sf. 

A.//. 

SSS, 8Sf, sfs, sff. 
fss,f4.ffs,/jf. 

1 

p. ?• 
pp, pq. 

qp. qq. 

, PPP. PP9. P?P» P<79» 
?PP. ?P?. ?9P» 32?* 



The table is easily extended on the same principle, and at each 
step, it will be noticed, a fresh pair of possibilities, « or /, is intro¬ 
duced, with corresponding p or g, to be combined with what has 
gone before. 

If the order of appearance of s and / is a matter of indifference, 
e.g. if it does not matter whether the first die shows 8 and the 
second /, or vice versa, so that results of the type -iff and fsf may 
be regarded as equivalent, we may use the index notation, as in 
the coin case, to render the table more concise, thus:— 



No. of 
Dice 
thrown* 


Different 

Poaeibilitiee. 


Correa ponding 
Probeoilitiea. 




When, therefore, n dice are thrown we again recognize the 
difierent possibilities as given by the successive terms in the ex¬ 
pansion of namely 

. . . +/", 

and the corresponding probabilities by the successive terms in the 
expansion of (p+g)". namely 

p"-t-"C|P"“V+"C*p"“V+ • • • +2"- 
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Hence the probability of throwing n three3=p'‘=l/6" ) 

„ ‘ («-i) » 

1 5 

=n .-. - 

6"-i a 

=5nlQ'' ; 

.. (n-2) „ ="C 2 P-V 

_n(»—1) 1 5> 

1-2 ■ ^2 ■ C2 

=25n(n—1)/2.6" ; 

and 80 on. 

The result we have just obtained is t>f perfectly general a])plica‘ 
tion. Whether we spin n coins, in which the probability, p, of 
success (say ‘ heads ’) for each is 1/2, or throw n dice, in which the 
probability, p, of success (say ‘ to get a 3 ’) for each is 1/0, or have 
any n similar but independent events happening in which the 
probability of success for each is p, the different resulting possi* 
bilities as to success are given by the successive terms in the expan¬ 
sion of («+/)", and their corresponding probabilities are given by 
the successive terms in the expansion of (/>+?)"• 

We are thus in a position to form a frequency table, like that on 
p. 63, showing the probabilities of getting 0, 1, 2 ... n successes 
(in other words, the proportional frequencies of these different 
numbers of successes) at the occurrence of n similar independent 
events, where p is the probability of success for each and g is the 
probability of failure :— 

Tablb (36). Binomiai. Distbibution. 


(l) (2) (S)_(11 


Nnmb«rof 

Suooeuet. 

Freqoeaoj. 

Product of No 0 . in 
QoU. (1) A (2). 

A _ 

l^rodnot of Kos. In 

Cola. (1) & (3). 

(») 

(/) 


{/**) 

0 

9" 

/ ® 

0 

1 



n5*“'p* 

2 

1 -2 ^ ^ 

n(n— 

2n(n— 

3 

n(n-l)(»-2)__,j 

n(n-l)(n^2)^-5^ 

3n(n-l)(n-2}^_,^ 


1-2 ^ ^ 

1-2 ^ ^ 

• 

# # 

* 

\ 

• • 

e 


« • 

• * 

• 

r 

*p" 

nV 


1 
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Col. (1) gives the deviations from the origin ot meaauremfcat, 
which in this case is taken as ‘ no successes/ the class interval 
being equal to a difterence of 1 in the number of successes. 

The summations of the last three columns are efiected ae 
follows :— 

Col. (2). 

= 1 , 

because 'p-\-q=\. 


Col. (3). 




=np{q-^p)»-^ 

—np. 

Col. (4). 

nq^-^p^+2n{n — 1 ^ 2 ^— 

5=npj^g’'~i+2(n——??g"-3p2-f- . . . +np'‘''^J 

=npj^|g'»-^+(w—l)g»-V+ ^” 


+ 


|(n-l)g'‘ 


. . . +(n_l)p'-4 

1*2 J« 


=ng)[(g+p)'*-i+(n—llpjg^-a+fTi—2)g"-3p+ - • * +P"'*}J 
=7i_p[l 4-(n— l)p(g+p)**-2] 

=np[l+i>(n—1)]. 


The arithmetic mean of the distribution 

=8um of terms in col. (3)/sum of terms ic col- (2) 

=Z{Sx)!S{f) 
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The mean-square deviation referred to zero as origin, zero in tliis 
case corresponding to ‘ no successes ’ 

=Bum of terms in col. (4)/sum of terms in col. (2) 

Thus the standard deviation, a, is given by 

u^=Tvp\l-\-p{n— 1 )]— 

where S is the deviation of the mean from the origin of measure¬ 
ment, so that x=np. 

Therefore a®=nj3[l+p(n—1)]— 

=np (1 — p)+ 

=npq. 

Hence a=\/(npq), 

and p.e.=0-6745 \/(oPfl)* 

These two results are exceedingly important, and it is essential 
to understand what it is they measure. An example may help 
to make this clear. 

If we spin 300 coins, counting ‘ head ’ for each a success, the 
number of heads we shall get \vill be unlikely to differ very greatly 
from the average or mean number of successes, np, t.s. 150 if p= 1/2 
for each coin, and in the long run, if we repeat the experiment a 
great number of times, we shall get a proportion of about 150 heads 
to every one experiment. Again, if we throw 300 dice, counting 
every throw of the number 6, say, for each die a success, so that 
p in this case=l/6, the number of fives we shall get will be unlikely 
to differ much from np, *.e. 60, and in the long run, if we repeat the 
experiment a great number of times, we shall get on the average 
a proportion of about 60 fives to every experiment; we should 
find, for example, something like 5000 fives if we threw 300 dice 
100 times in succession. The arithmetic mean of the distribution 
tells us therefore about what number of successes to expect in one 
experiment with n events if n is fairly large, though we should be 
unlikely to get exactly this number if we confined ourselves to the 
one expenment. 

The second result, the S.D., supplies us with a measure of the 
unlikelihood of getting the exact number of successes expected at 
any single experiment, for it defines the dispersion of the different 
numbers of possible successes about their average. Clearly the 
greater the dispersion, the greater is the likelihood of missing the 

E 
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U6 


average. The mean number of succease.s when an experiment U 
repeated a great number of times is nj), but at any single experi* 
ment it is not unlikely tliat the number of successes obtained may 
diller from np by as mueh as 0-674o ^ excess or in defect; 

it is, however, unlikely, as we shall see later (p. 244), that the 
juimbcr will differ from np by more than 3-\/{npq) in excess or 
defect when the distribution is not very skew, or unsymmetrical, 
especially if n be large. The probable error in the case above when 
we throw’ a sample of 300 dice is 

=0-0745 \/(300xl/6x5/6)=0-6745x/(41-67}=4-4, 

and it is therefore quite likely tliat the number of fives obtained 
at one e.xpeiimont will differ from the expected number, 50, by as 
much as 4 or 5 in excess or defect, but it is unlikely that the number 
will fall outside the limits 50d=3\/(41-67), say 30 to 70. 

It is sometimes more convenient to refer to the proportion of 
succe.sses, etc., expected at any experiment rather than to the 
actual number expected. In that case, since with n events the 
expected number of successes is pn, but the number obtained may 
quite likely differ from this by ±0-6745 therefore with 

n events the expected proportion of successes is pnjn, i.e. p, with 
quite possibly an eiTor=±0-6745\/(7ip5)/n, i.e. ±0-6745-\/(/’9/^)’ 

Thus, with the 300 dice, the expected proportion of successes at 
one experiment lies between 

tl/6-0*6745V(l/6x5/6±300)] and [ 1 / 6 + 0 - 6745 V(l/6x6/6±300)] 
i.e. (1/6—0-6745/46-5) and (1/6+0-6745/46-5) 

i.e. 1/5-5 and 1/6-6 ; 

and it is unlikely that the proportion will differ from 1/6 by more 
than 3/46-5. i.e. 1/15-6. 

To illustrate how the binomial distribution might bo directly 
applied, an experiment was made w’ith 900 digits selected at random 
by taking in succession the digits in the seventh decimal place m 
the logarithms of the following numbers ;— 

10054, 10154, 10254, . . . 99954, 

as given in Chambers’s Mathematical Tables. In this way each of 
the 10 digits, 0, 1, 2, 3 ... 9, may be supposed to have stood an 
equal chance of selection each time one was written down. Gaps 
of 100 were left between the numbers selected so as to avoid runs 
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of the same figure which sometimes occur even in the seventh 
decimal place owing to lack of independence. 

The digits were arranged in 30 columns, each column containing 
25 digits, and in this way wo obtained what was equivalent to 
36 separate but like experiments with 25 events each. If we agree 
to regard the appearance of a 7 or an 8 as a succes.sful event, and 
the appearance of any other digit as a failure, the chance of succes.s 
at any appearance is 2/10, and the chance of failure is 8/10. The 
case is thus of exactly the same kind as that of throwing 25 dice 
36 times in succession, and if the probability of success, namely 1/5, 
for each independent event, be denoted by p, and the probability 
of failure, namely 4/5, by g, the distribution of successes and failures 
should approximately conform to that given by the expansion of 

for any particular experiment, and since the experiment was re¬ 
peated 36 times, the total numbers of successes and failures of 
difierent orders obtained should approximately conform to 

36(p+3)«, 

for if the probability of an event is p the number of events to be 
expected in N trials is Np. 

The actual distribution observed is compared with that given 
by the binomial expansion in Table (36). Col. (2) is obtained by 
picking out the appropriate terms in the expansion of 36(p-|-g)**, 
where p=l/6, 3 = 4/6 ; this expansion is 

Thus, 6 successes occur 


36 


25 • 24 . 


6 


1 • 2 ■ 3 . . .20 


p^3» 


times, and this equals 7’06, or approximately 7. 

The mean number of successes by theory=np=25/5=5. The 
mean by trial, since it is measured from zero as origin, the numbers 
in col. (1) being the deviations, 

=2:{fx)l£[f)= 162/38=4-5. 

The standard deviation by theory 

= V(»»M)=V(25 X i X l)=2. 
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Table (36). Distribution of Successes (getting a 7 ob 8} m 
THE Random Choice of 25 digits 36 times in succession. 


(1) 

(2) 

(3) 

(.*) 

(6) 

No. of 
Succe>4es. 

^ Frctjuency 
by 

CdlculatioQ. 

j Frequency 
Expcrimeot. 

1 

Product of Product of 
Kos. in Nos. in 

Col8.(l)<St(3). Cols.(l}&(4). 



if) 

(fx) 

(fx^) 

1 

1 

1 

1 

1 

2 

3 

5 

10 

20 

3 

6 

6 

15 

45 

4 

7 

7 

28 


6 

7 

9 

45 


6 

6 

4 

24 


7 

4 

3 

21 

147 

8 

2 


0 

0 

9 

1 

! 2 

18 

162 

1 


36 

36 

162 

856 


By trial, the mean square deviation, measured from zero as origin 

=ZSx^jZf 

=856/36. 

Thus the S.D. by trial=Vl Vs"— 
where x is the deviation of the mean from the origin, 

= ^[ 856 / 36 -( 4 - 5)2 
= 1 - 88 . 

It will be seen that not one of the 36 experiments gave a number 
of successes differing from 5, the theoretical mean, by more than 
twice the S.D., for the number ranges only between 1 and 9. 

If we treat the 900 digits as 900 separate experiments with one 
event each, instead of treating them as 36 experiments containing 
25 events each, we have 1/10 as the chance for the appearance of 
any particular digit, and hence the number of times any digit may 
be expected to appear 

approximately 

= (900)iV± I ^(900 X tV X A) 

=90±6. 


The actual number of occurrences of each digit was as follows 


Digit . . . . ' 

0 

1 

2 

3 

4 


6 

1 

8 

9 

No. of Occurrences . , 

1 

1 

95 

1 

96 j 

93 

105 

91 


82 

1 

m 

90 
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30 that the digit 7 showed the greatest divergence from 90 of any, 
and this was only just three times the probable error. 

[The Theory of Probability is older than that of Statii^tics. Todluinior, in 
hia Historif, states that' writers on the subject have shown a justifiable pride 
ID connecting; its true origin with the great name of Pascal/ Tin* weihknown 
story of the latter being found, as a lad of twelve, tracing out on the hall floor 
geometrical propositions which he had evolved in his own head is not to be 
wondered at, nor yet that at sixteen he wrote a small work on Conic Sections, 
when one reflects upon the fame he was to win as a philosopher and writer, 
as well as a mathematician, in his too brief life of thirty*ninc years. He was 
born in 1623 of a distinguished French family, and for the last half of his 
Ufe he suffered from the effects of a serious disease which contributed to turn 
his attention from mathematics to religion and philosophy. 

We learn from Todhunter how a certain gentleman of repute at the gaming 
tables set Pascal pondering on a question of probability concerning the fair 
division of stakes between two players who give u{> their game before its con¬ 
clusion—an old problem cited in a work by Luca Pacioli as early as 1494. A 
correspondence followed between him and Fermat, then prob<ably the tw'o most 
distinguished mathematicians in Europe, and so began a science which has 
fascinated at one time or another all great mathematicians from that day to 
this. 

The illustrious family of the Bemoullis, friends of Leibnitz, who championed 
his claim against that made by English mathematicians on behalf of New'ton 
to the invention of the Calculus ; De Moivre, an exile in England, owing to 
the revocation of the Edict of Nantes; Euler, Lagrange, and Laplace, who 
worked out in algebraical form Newton's theory of gravitation for the motion 
of the planets—all these bad a share in building up the science of Probability, 
often by investigating problems in games of chance, where the conditions can 
be made mathematically perfect, so by careful analysis preparing the w'ay for 
the use later of the same principles in matters of greater importance. 

It has been said that the development of the subject owes more to Laplace 
(1749-1827) than to any other mathematician ; nor did he confine himself to 
its theory : be would have earned fame by his astronomical applications alone. 
EQs method was to take certain observations, and to determine by means of 
probability whether the abnormalities present were merely the results of chance 
or whether there was some as yet undiscovered but constantly acting cause 
behind the phenomena observed. In this way he was led to highly interesting 
and important results such as those relating to the theory of the tides, the 
effect of the spheroidal shape of the earth on the motion of the moon, the 
irregularities of Jupiter and Saturn, and the laws which govern the znotion 
of Jupiter’s moons. It needs but a step in thought to pass from the dia- 
ouBsion of such physical data to the statistics of social phenomena and the 
causes which determine abnormalities met with in that field. Professor Edge- 
worth, in making reference to books that have been WTitten on Probability at 
the end of his excellent article under that beading in the Encyclopasdia 
Britannica^ remarks that * as a comprehensive and masterly treatment of 
the subject as a whole, in its philosophical as well as mathematical character, 
there is nothing sizxiilar or second to Ia place’s Thiortc analyl%que des 
probabiUtis'^ 


CHAPTER XIII 

SA^^>L^NG {contimied)—FovMTji^ for probable errors 

So far we hare only considered the most simple case of random 
sampling when we take a sample of n independent events each of 
which falls into one of two classes according to its nature, the 
chance of entering either class being the same for every event: 
we have dealt, that is to say, more particularly with non-measurable 

characters. We pass on now to measur¬ 
able characrters which are distributed 
among several classes according to their 
size, so that a frequency distribution 
table can be set up for each sample; and 
assuming that the population from which 
the samples are drawn is homogeneous, 
the samples themselves containing each 
an adequate number of individuals, there 
should not be greater diflferences between 
one table and another than can be ac¬ 
counted for by random sampling. It is 
our object to discover how great such 
SAMPLE. differences may be. 

Given a homogeneous population of N 
individuals which we will suppose could 
be distributed into a number of groups, 
Y, individuals in the first group, Yg ^ 
second group, Yg in the third, and so 
on, according to the size of the organ or 
character under observation. Suppose a 
random sample of n individuals be taken 
from this population, and when they are 
assigned to their several groups let the 
frequency table now take the form shown, 
with individuals in the first group, 
in the second, and so on. To find the probable error of y^, the 

frequency observed in the kth group. 

160 ^ 



GENERAL POPULATION 


CIa8$). 

Frequency. 

l8t Group 

Y, 

2nd Group 

Y, 

» 1 

Group 

4 4 

# # 

Y* 

♦ % 

N 
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Consider the selection of the n individuals* one by one in succession, 
to form the sample. When the first choice is made the probability 
that we shall got an individual falling into the kih group is, by defini¬ 
tion* Yfc/N, and the probability will remain practically the same for 
each successive choice granted that N is considerable. We have thus 
n independent events, the chance of success (falling into the Hh 
group) for each being 3 >(=Yj^*N) and the chance of failure being 

<7l=l —The case is therefore analogous to the one pro* 

\ N/ 

viously considered to which the binomial distribution is applic¬ 
able, so that the frequency to be expected in the ith group is np 

with S.D.. Oy ; i-e. yt=np with a ;p.c.=0-6745\/nj)?. 

Now in practice the numbers Y,, Yg, Y 3 . . . would not be known, 
and hence the true value of p would also be unknown, but since 
yic=np, approximately, when the sample is of adequate size, we 
shall get a fair idea of the probable error involved by taking 
p=y)ijn, where y*. is the actual frequency observed in the A:th group. 

Hence, • • • (1) 


and the frequency in the feth group 


=y,±0-6745^/y,(l-^‘ 



The size of the S.D. is under ordinary conditions a test of the 
adequacy of the sample, for the frequency in the A:th group, if due 
simply to random sampling, 
should not differ from its 
expected value by more than 
3a and a^^ should therefore 

be small compared with 
itself. 

To find the correlation between 
the frequencies in any two 
groups of a sample distribution. 

Let the expected frequencies 
in the various groups of the 
sample be denoted by yi, y,, 

. . .and suppose an 

error Sy^ in yj^ is associated 

with errors Sy,, Sy,. 8 y„ ... in y^ y,, . . .. y». • • • 

require then the correlation between y^ &nd y«. 


Clafli. 

Expected 

FrerjucDcy. 

Observed 

Frequenej. 

Iflt Group 1 

Vi 

Vi + 

2Dd Group 

% « 

y» 

• # 

y»+^y* 

r • 

Ath Group 

• • 

Vk 

• # 

Vk + ^Vk 

9 9 

«th Group 

• • 

y. 

• • 

y,+Sy. 

- - \ 

. 

n 

n 
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Now although the group frequencies may change relative to one 
another, the total sum of frequencies in all groups is not affected, 
because the n individuals of the sample make up its composition in 
each case : to keep n constant the group frequencies must adjust 
themselves accordingly, which explains the correlation between 
them. Hence to compensate for an excess, (assuming 6^*+“), 
of frequency in any one group there must be a defect (-Sy*) shared 
among the other groups, and the fairest way of sharing will be in 
proportion to the expected frequencies in the several groups. 

But the total frequenc}’ divided between groups other than the 
4th is {n—y^), so that the proportion of (—8y*) due to theath group 
is y,l{n~y^), thus 

Sy.= ^(-8s,.). 

n-~y„ 

Therefore, By^ . By,= -y,. hy\!{n~y^) 



FIRST SAMPLE. 


Size of Organ 
or Character 
observed. 

Frequency of 
Observations. 

First Moment. 

Second Moment. 


y\ 

*iyi 

Avi 

• ♦ 

Vi 

» e 

•. 

Avt 

e e 

e • 

»• 

«• 

• • 

# e 

Vk 

•« 

« 4 

• • 

•. 

XkVk 

•« 

• * 

4 4 

e # 

e a 

e o 

1 

n 

2(xy) 



This gives the product moment of the deviations from y* and y, 
in one particular sample; summing for all such samples, remem¬ 
bering that by definition the coefiScient of correlation between y* 
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and y, is r„ „ =E{hyi.. Py , where v is the total number 


of samples, also a^y =i78y**/'’. have 

A» 


vr 




^ • Vk' 


n 


Therefore, r = —_ . . . , . (4) 

gives the correlation required. 

To find the p.e. of the mean of a sample of n observations. Let a 
frequency table be drawn up in the usual maimer showing the 
number of observations t/|, . . . corresponding to organs of 

different sizes x^, ■ 

The mean referred to some fixed point as origin is then given by 

M=(a;ii/i+a:2i/2+ • • •)/« i 

also the mean square deviation of the sample referred to the same 
fixed point is say, given by 

/A2,= (x*iyj+a:*2t/2+ . . .)/». 

and M*=c* 

where a is the S.D. of the sample. 

For another sample of the same size the frequency distribution 


SECOND SAMPLE. 


Sue o1 OrgM 
or Character 
observed. 

Frequency of 
ObeervfttioDa. 

Firet Moment. 

*1 

Vi +«yi 

*i(yi + *yi) 

• e 

yj+«y, 

*i(ya+*yi) 

♦ « 

• • 

• # 

• • 1 

• • 

Vk+ 4yt 
•» 

« « 

• • 

• ^ 

Xk(y*+*y*) 


n 

2x{y + dy) 


may be slightly different, say, yj+Syi, yj+Syn • • •» and conse¬ 
quently the mean will also be different, say, 

M+8M=[Xi(y2-|-8y,)-fa:,{y,+5y,)+ . . . ]/». 
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and, by subtraction, 


SM=(a:,Syi+X 28 y 2 + . . (5) 

Now we want to determine the S.D. of the different values of M 
found among the different samples, and that is given by 

where 27 denotes summation for all samples and v is the number of 

samples. This suggests that we should square both sides of 
equation (5), getting 


. 8M2 = x2,V, + . . . +2x,X2SyjSy2+ . . . 


Therefore, . . . +2xiX2(-"'‘'-i'. 

by (3). Hence, making use also of (1), 


n 


n 


yi 


2xjyj . XjT/j 


n 


~(^\yi+ ■ ■ . . . . +2xjt/j . . .) 


:=nM22-i(x,y,+ . . .)» 

n 


Thus a\={H-\~'i,l^)ln=a^jn, 

and the probable error of the mean=0-6745a/v^ . . . (6) 

The p.e. in the arithmetic mean found by taking a random sample 
of n events is a measure, so to speak, of the failure to hit the absolute 
mean, and it follows that the precision, of the sample, the accuracy 
of aim at the mean, would be not unfairly measured by some 
quantity proportional to the reciprocal of the above expression, 
namely, y" n/O-6745 (t. With such a measure the precision would 
evidently be increased if the number of observations in the sample 
were increased, being proportional to the square root of their 
number. 

[It is desirable to draw a distinction here between what have been 
termed biassed errors and unbiassed errors ; errors due to random 
sampling are of the second class for there is, by hypothesis, no 


[* We do not know the true mean for the population as a whole, but we take 
in place of it M, the value given by the sample, which we may do with little 
error if n is large. Similarly «r is the S-D. of the sample.] 
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reason why they should be in one direction rather than in another. 
Biassed errors, however, all tend to be in the same direction and 
they may arise in different ways, e.g. they may be due to faults of 
omission or commission on the part of the observer himself : he 
observes either carelessly or badly, omitting certain factors whicli 
ought to be taken into account, or so measuring or classifying his 
results that they appear always larger or less than they really are 
in fact. 

Sometimes, although the bias is known to exist, it may be im¬ 
possible to correct it: the most one can do is to bear it in mind 
and allow for it in using the results. A familiar example of this 
occurs in the collection of household budgets from the poor to find 
their standard of living, where it is only possible to get particulars 
from the more intelligent and thrifty class among them. 

Whereas in the case of unbiassed errors due to random sampling 
we can diminish the probable error of the average by increasing 
the number of observations, the same is not true of errors which 
are biassed, for suppose an error e in excess be made in each of 
n observations Xj, Xj, . . . x,„ the effect upon the average is co 
increase it from 

*^1+2^2+ • • • (3^i+e)+(-^2+e)+ • • ■ 

-to --—-* 

n n 

i.e. from 

^l+-^2+ • • • 

n n 

so that the average is over-estimated by precisely the same amount. 
If, therefore, we know that bias exists, it is well, if possible, to 
correct it in each observation, for by so doing we change biassed 
into unbiassed errors, and though our corrections may be somewhat 
wide of the mark, the resultant error will then be diminished by 
increasing the number of observations : e.g. a farmer offers 400 
sheep for sale and, being anxious to make a good bargain, he asks 
a higher figure for them than he is in reality prepared to take ; 
let us suppose that this excess is 23. 6d. for each sheep, then clearly 
the average price per sheep at which he is prepared to sell wdll be 
less than the amount he asks by 2s. 6d. also. But now suppose the 
buyer, a simple person knowing little of the prices of sheep and 
less of the ways of men, goes through the flock one by one and 
makes the error of offering a price either much above or much below 
what the seller is prepared to take ; even if his unbiassed offers 
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differ by as much as 10s. for each sheep from the seller’s reserve 
price, so long as they are random in direction, i.e. sometimes too 
much and sometimes too little, the resultant difference in the 
average from what the s^er is prepared to take will probably not 
greatly exceed glOs./V^OO” or 4d. per sheep. 

VVe can sometimes diminish the effect of bias, even when its 
extent is unkno^vn, by working with the ratios of the quantities 
affected instead of with the quantities themselves : e.g. suppose 
biassed errors, and € 2 , enter into the measurement of the variables 
Xj and X|, both in excess, the ratio of the variables then 


— (Xi+CiVfXa-f-ej) 




if we omit higher powers of e, and than the first on the under¬ 
standing that they are both comparatively small. Suppose, for 
example, there was an error of 6 per cent, made in measuring Xj 
and an error of 3 per cent, of like sign in measuring Xj then the 
resulting error in Xj/Xj would be 5 per cent.—3 per cent.=2 per cent. 
Clearly the same holds good also if the errors are both in defect. 
This explains why a comparison of results arranged, say, on the 
index number principle may be trustworthy, although the method 
of formation of the numbers themselves may be in some respects 
faulty, granted that the same faults are repeated each year so as 
to produce like errors, i.e. the bias is to be unchanged in character. 
To correct the faults in one case and not in the other would prejudice 
the success of the method, since it depends upon the errors counter¬ 
acting one another.] 

Example (1).—To illustrate the important result we have obtained 
for the p.e. of the mean of n observations let us return to the experi¬ 
ment of selecting 900 random digits. The distribution actually 
obtained, and the theoretical distribution to be expected in the 
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long run if the experiment were repeated several hundred times and 
the average taken, are 8ho^vn in the following table :— 


Table (37). Distribution of 900 Random Digits. 


Digit. 

Frequency 

Obaervctl. 

TLcorctical 

Frequency. 

Digit. 

Frequency 

Obeerved. 

Theoretical 

Frequency. 

0 

95 

90 

6 

80 

90 

1 

96 

90 

6 

82 

90 

2 

93 

90 

7 

72 

90 

3 

105 

90 

8 

90 

90 

4 

91 

1 

90 

9 

96 

90 


It ia a simple matter to calculate the mean and S.D. for the dis¬ 
tribution from this table in the usual way ; the results are ;— 

Observed mean=4*38; S.D.=s2-911 
Theoretical mean=4*50; S.D.=2'872, 

Thus the p.e. of the mean based on the sample 

= ±0-6745 X 2-911/^^ 

= ±0-065, 

and 4-38 differs from 4-50 by less than three times the p.e. 

The 36 averages of samples of 25 events apiece were also calcu¬ 
lated, and the following were the results obtained ;— 

2-7C, 3-32, 3-68, 3-72, 3-72, 3-72, 3-76, 3-80, 3-92, 3-92, 4 08, 4-12, 
4-16, 4-16, 4-16, 4-28, 4-36, 4-40, 4-40, 4-40, 4-44, 4-60, 4-64. 4-68, 
4-72, 4-72. 4-76, 4-88, 4-96, 5-00, 5-00, 5-00, 5-08, 5-28, 5-40, 5-72. 

The mean of this diatribution= 157-72/36=4-381, and the 
8.D.=:0-612. But the S.D. of the whole distribution of 900 digits 
=2-911, and therefore the S.D. of the distribution of averages of 
samples of 25 digits should be 2-911/'v/25=0-682, differing from 
0-612 by about 6 per cent. 

To find the p.e. of the sum or different^ of two variables. Let the 
mean values of the two variables be denoted by y and z, so that 
deviations from these values found in a particular sample may be 
denoted by hy and Sz. If then we write 

«=y±a 

wo have 

3u=8y±8z . • • (7) 
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To find the S.D. of u we therefore require 2J{Su^)/p, where the 
Z" denotes suniiuation for all samples and v is the number of samples. 

But, .sijiiaring botli sides of equation (7), we have 

Thus Z8a2=ZSy2+Z522+2Z(5yS2). 

where the summation extends to all samples. Hence 

+ t'a\+ 2m^<T^r^, 

or 

where is the correlation between the variables. And the 
p.e. = 0-C745a.,. 

The p.e. of the difference of two variables follows at once by 
changing the sign of z throughout; for, if 

0 = 1 /— 2 , 

we have 8 o®=Sy^-{- 82 *— 281 / 82 , 

and ff\=cr%+(T\-2T^^<r^<r,. 

Generally, if x^, x^, . . . be the mean values of n variables, 
and if Sar^, hx^, . . . 8 x„ denote deviations from these values in 
a particular sample, we may writ© 

w=a:,+X 2 + . . . +x„ 

8w=Sxi+Srad- . . . +Sx„. 

Z 8 u“=ZSxj 2 -f- . . . +2Z(Sxi8x2)+ . , , 

• • • 

Important Corollary. If y and 2 are quite independent so that 
r^^ is zero, the p.e. of their sum and the p.e. of their difference 
have the same value, namely, the square root of the sum of the 
squares of the p.e.’s of y and z themselves, which 

=0-6745V{(7*,+a^J . . . (8) 

This result is exceedingly important, because it can be directly 
used to test whether a difference between two samples is accidental, 
t.e. w’hether it is such as might arise through sampling, or whether 
it implies a real difference between the two populations from which 
the samples are selected. An example will illustrate the pro* 
cedure:— 

Example (2). In a study of Minimum Rates in the Tailoring 
Industry, by R. H. Tawney, a table is given (p. 114) which suggests 


and 

Thus 

whence 
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that ‘ in the north of England women work in the tailoring trade 
when they are young ... in London and CoU-licster they have 
to work when they are older.’ Taking some figures from that 
table we find :— 


Disiriot* 

Workers over 

C^orkeri at 

Proportion 

35 jears old. 

all ages. 

over 35. 

LondoD aod Essex 

11.718 

35,316 

0332 

1 Manchester and Leeds . 

4,029 

21,822 

0 185 


The difierence between the proportions over 35 years of ago 

= (0-332-0185)=0147. 

Let us suppose for the moment that this difference is not significant 
of any real difference in conditions between the two districts, but 
is merely due to random sampling. In that case the most natural 
value to assign to the true proportion of women workers over 35 
for the trade as a whole, as given by these figures, would be 

_ 11,718+4,029 _15^^q.276 

^ 35,316+21.822 57.138 

The S.D. for the first sample (London and Essex) would then bo 

CTi=-\/(p5'/ti)='v^[0‘276x 0'724/35,316], 
and for the second sample (Manchester and Leeds) would be 

a,= V[0-276x0-724/21,822J. 

Hence the p.e. for the difference between the proportions in the 
two samples would be roughly 

by (8), 

= i+[0-276 X 0-724(1/35,316+1/21,822)] 

=5+[0-276 X 0-724/13500] 

=00026. 

The actual difference between the proportions, 0*147, being much 
more than 3(0-0026), is certainly significant of a greater difference 
between the two populations than can be explained by random 
sampling alone. 
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Another method of attack would be to assume a real difference 
between the two populations, if other considerations led us to 
suspect such a difference, and to find whether such a difference could 
be disguised by random sampling. In that case the proper pro¬ 
portion to assume for the first sample would be 0-332, giving 

<Ti = y [0-332 X 0-668/35,316]- V628/10^ 

and for the second sample the proportion would be 0-185, giving 

<^2=V[0-185x 0-815/21,822]=y691/10*. 

Hence the p.e. for the difference between these two proportions 
due to random sampling would be 


= by (8), 

= i^^y(C28+ 091) 
=0-0024. 


The actual difference is 0-147, which certainly could not be out¬ 
balanced by an error in the opposite direction due to random 
sampling, because it is much more than three times the probable 
error due to sampling. 

Sometimes we have to test the difference, not between two 
simple proportions, but between two sample distributions. In 
that case the mean of each sample may be calculated so that the 
difference (Mj—Mj) between the means is known; to find out 
whether or not it is significant of some real difference between the 
two populations from which the samples are drawn, (Mj—Mj) 
is compared with its p.e., namely 

0 - 6745^(^2 

or 0^Q14:6^/{q^y|n■^~\-a\jn2) . . . (9) 

where Uj and are the numbers of observations in the two samples 
respectively, and a,, are the S.D.’s of the samples. Unless 
(Mj—Mj) is definitely greater than some two or three times this 
expression we cannot be very sure that the difference between Mj 
and Mj may not have arisen merely through random sampling, 
and it may quite likely not be significant * of any real difference 
between the two populations as regards the organ or character 
which is under consideration. 


[* It should be obserred that the S.D. provides a wider margin for significanoe 
than the p.e., because a range of approximately 3 p.e. =3'§<r = 2v only. It it 
quite safe therefore to attach do great sigaificance to a difference which does 
Dot exceed two or three times the p.e.] 
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Example (3).—Statistics have been collected to test whether there 
is any significant difference between the eggs laid in general by 
cuckoos and those laid by them in the nests of particular species 
of foster parents. Results of the following kind were obtained 
[see Biometrika, vol. iv., pp. 363-373, The Egij of Cuckulus Canorus 
(2nd Memoir), by O. H. Latter]:— 



Moan _ ^ I Signi- 
Length ficanco 

(mms.) (tnina.) 


Remarks. 


Eggs of the Cuckoo 
race in general 
Eggs laid in nests of— 
Garden Warbler . 
White Wagtail 
Hedge Sparrow 


1672 22-3 0-9G42 .. 

91 21-9 0-7860 7-0 

115 22-4 0-7G06 1-6 

68 22-6 0-8759 3-75 


Significant. 

Not significant. 
Probably significant. 


The difference between the mean lengths of eggs laid in the nesta 
of garden warblers and those laid by cuckoos in general 

—22-3—21'9=0'4 mms. 


The p.e. of this difference 

=0-6745V[(0-78aO)V91+(0-9G42)2/1572], by (9), 
=0-6745V(0-007380) 

=0 058. 


Hence the significance test 
=0*4/0058=70. 

and we conclude that the difference in length between the two 
classes of eggs is certainly significant. Similarly the other cases 
may be tested. 


In the example just given, to find out whether one population 
differed from another, the arithmetic means have been compared; 
but the mean alone will scarcely serve to establish the identity of 
any population. For example, we can conceive of two distinct 
races of men, both of the same mean height, but one race embracing 
a number of giants and dwarfs. Of course if we agreed to define 
two races as identical when they have the same mean heights, there 
woiild be nothing more to be said, but that would certainly only 
be a very rough>and>ready attempt at classification. 

Taking into consideration only the character of height, a further 
step in definition would be to measure the mode or most fashionable 

L 
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height, and the dispersion or variability—absolute : the standard 
deviation, and relative : the coellicient of variation—of the two 
races. Then, after comparing heights with sufficient detail, the 
attention could be turned to innumerable other characters, skull 
and body measurements, physical, mental, and even moral 
attributes. 

Clearly the difficulty of definition and of establishment of identity 
grows as we pass along the scale from physical to moral. Moreover, 
other statistical constants must be requisitioned when the question 
of the existence and degree of relationship between two organs or 
characters is to be determined. As the S.D. and the C. of V. serve 
to measure the amount of variability, so the coefficient of correlation 
comes in to measure the amount of likeness or association. Further, 
and especially in problems of inheritance, the coefficient of regres¬ 
sion must be measured. It might seem at first sight hopeless to 
try and measure the correlation between two such characters as 
athletic capacity and health in the same boy, or between the 
truthfulness of one boy and that of his brother ; but the genius of 
Karl Pearson has gone some way to solve even this difficult problem 
by means of a system of adjectival instead of numerical classifica¬ 
tion [see Phil. Trans., vol. 105a, pp. 1-47, On the Correlation of 
Characters not Quantitatively Measurable, and, as an exceptionally 
interesting application of the method, see Pearson, On the Laws of 
Inheritance in Man, ii. ; On the Inheritance of the Mental and Moral 
Characters in Man and its Comparison with the Inheritance of the 
Physical Characters; Biomelrika, vol. iii. pp. 131-190]. In short, 
for a full and exact definition of a population of any kind, human 
or otherwise, it is necessary to measure not only the means, but all 
the more important statistical constants, modes, medians. S.D. s, 
C.’b of V., coefficients of correlation and regression, and so on. and 
it is no less necessary to calculate also their probable errors if we 
are to test the real significance of such differences as are observed 
in these constants between two samples from the same or from 
different populations. 

The probable errors for the more important constants, some of 
which are only introduced later in the book, are collected together 
in Table (38) for reference. The proofs in general are a little intricate 
and would be lacking in interest to the ordinary person, who is 
satisfied to take algebraical analysis on trust so long as he under¬ 
stands the nature of the results he uses, but the more mathematical 
reader who is anxious to see proofs may refer for some of them to 
Biomelrika, vol. ii., pp. 273-281, Editorial, On the Probable Errors 
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cf Frequency Constants, wlxich has been freely consulted on the 
subject here. 

The usual notation is adopted, n being tlie total number of 
observations in the given distribution, supposed normal in general, 
a the S.D., eto. 


Table (38). Probable Errors of Statistical Constants. 


Statistical CoDstant. 1 

Probabl 

e Erior (=0C:-lj S.D.). 

Any observed group frequency, y 

0 07I5X V[y(l-y/«)] 

The mean of a distribution of any type 

It 

<r/V 7i 

The S.D. of a normal distribution, <t . 

M 

elV2n 

The second moment about the mean, 

II 

(7V2/n 

ff third „ „ „ ^3 

II 

tr^y/G/n 

ff fourth „ „ ^4 

II 

<7^ V'yC/rt 

The coefficient of variation, ti . • 

M 

V2«L ^ MOO/ J 

The coefficient of correlation, r • 

>1 

{l-r^)/Vn 

The correlation ratio, . 

• 

X, as determined from (X —X) = r—(Y —Y), 

II 

{1—T]^)/Vn, nearly 

(T y 

when Y is given . 

Y, as determined from (Y—Y) = r'^(X —X), 

^ when X is given . . . • . 

9 


II 

<ryV(l-r^) 

Distance between mode and mean in a skew 

II 

<rV{3/2») 

distribution ...... 

Skewness. 

II 

V(3/2«) 

fit (which should = 3 for a normal distribution) 

II 

V(24/») 

( »* ft “0 ,, „ ) 

|> 

0 

vA. 

II 

V(6/a) 


Example (4).—In the example which follows are given data 
necessary for testing the significance of differences in variability 
M well as in mean values. They represent an attempt made to 
find whether members of a particular species of crab caught in 
shallow water differed with regard to certain characteristics from 
those caught in comparatively deep water [see Biometrika, vol. ii., 
pp. 191 ei $eq.. Variation in Eupagurvs Prxdeauxi, by E. H. J. 
Schuster], Only a few of the results are recorded here, to two 
decimal places ; the reader will find it a valuable exercise to verify 
for himself the p.e.’s given in each case. 
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Meisuremeiit Mnd<». 

Sex. 

Lorrtlity. 

Carapace length 

Male 

Deep water 

«i »• 


Shallow 

»• 1 « 

Female 

Deep 

«• tt 

tf 

Shallow 


8-50db0O5 
8-41 ±004 
7.54±r>,o;S 
7-l2±u02 


107±004 
1-41»±003 
0i<4±0u2 
0*80±o02 


C. of V. 
p^r cent. 


-- ^.±^*44 

17-/0 . Q .37 

12.49|o.28 

12a2±0-26 



> V Piflerence of C.'s of V, 

DiITerenetof Me&iiA(tnm.). Difference ofS.u. t(mm.). percent 


0-lB±0 07 (poss. fiig.) 018±0*05(prob. eig.) 1*70±0*58 (pose, eig.) Male 

0-42i:0 0-l (aig.) 0-08±0 03 (poaa. sig.) 0-37dbO-37 (oot sig.) Female 


The significance or othermse of differences between variabilities 
in the case of cuckoos’ eggs (p. 161) might be tested in the same way. 




















CHAPTER XrV 


FURTHER APPLICATIONS OF SAMPLING FORUHIL^ 

We have been discussing in the last chapter how to teat two samples, 
supposed each to contain homogeneous material, to find out whether 
they belong to the same or to different types of population, but 
the further question often arises as to whether a sample is or is not 
homogeneous. 


Example (1).—^To this we may obtain a partial answer by w’orking 
out the statistical constants of the sample and their p.e.’s in order 
to compare them with the corresponding constants for a sample or 
series of samples believed to be homogeneous and of the same 
type. For example. Professor Karl Pearson has measured the 
skulls of skeletons of the Naqada race, excavated in Upper Egypt 
by Professor Flinders Petrie and presumed to be some 8000 years 
old, and he places his results for comparison alongside those 
for certain other races admittedly homogeneous [see Biometrika, 
vol. ii., p. 345, Homogeneity and Heterogeneity in Collections oj 
Crania ]:— 


&«ries. 

Number cf 
Obeervations. 

Variability (mm.). 

Skull Length. 

Skull Breadth. 


'AiDOS a • a 

76 

6-936 

3-897 


Bavarians • • 

100 

6-088 

6-840 

Skulls - 

' Parisians • • • 

77 

6-042 

6-214 


1 Naqodas 

139 

6-722 

4-612 


1 English a . a 

136 

6-085 

4-676 

TAviflff 

r Cambridge undergrad’ 

1000 

6-161 

1 5056 

heads 

English criminals • 

3000 

6-046 

6014 

[Oraons of Chota Nagpur 

1 

100 

6-916 

4-397 

Mean Yariabillty 


6-987 

4-877 


166 
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The S.D. of the variability of skull length calculated from this 
serics=0-120 mm. and of the variability of skull breadth=0-545 mm., 
and these su[)ply standards for valuing the differences between the 
Naqada and tlie mean variabilities. 

Another method of procedure is to take a random sample out of 
I lie sample itself, assuming the latter is large enough to admit of 
an adequate sub-samjile, and to compare the constants of the 

hole and part. When they do not differ beyond the limits allowed 
by random samjiling the inference is that the whole may be treated 
as a homogeneous class if judged by this test alone. 

Example (2).—In an interesting and important memoir, On 
Criminal An1hrop07nctry and the Identijicalhon of Criminal, by W. R. 
Macdonell [ISiometrika, vol. i., pp. 177 et 6eq.\, the author uses this 
method to test the homogeneity of a class of 3000 criminals by 
measuring also a random sample of 1306 criminals out of the 3000. 
He obtained, for example, 

S.D. of head length=6-04593±0-05265 mra., for the 3000 criminals ; 

„ =6-00247±0-07922 „ „ 1306 

The difference between the variabilities in the sample and sub* 
sample, by result (8) on p. 158, 

=0-04346±V[(f>05265)2+(007922)2] 
=0-04346±0-09512 

which is certainly not significant. If the same holds good with 
regard to the means and other constants, then the whole may be 
said to be homogeneous so far as this test goes. 

Example (3).—Another example may be given from the memoir 
on Variation and Correlation in Brain Weight, by Raymond Pearl, 
[Biometrika, vol. iv., pp. 13 ef seq.^ The author wished particularly 
to investigate the change of brain W'eight with age ; on the hypo¬ 
thesis that the weight of the brain reaches a maximum between 
the ages of 15 and 20, remains unchanged from 20 to 60, and then 
begins to decline and so continues till death, the material was 
divided into a ‘ Young ’ series, ages 20 to 50, and a ‘ Total ’ series 
including all between 20 and 80. The ‘ Young ’ series thus formed 
a selection from the ‘ Total ’ series, but a selection based on age 
and not on brain weight. If there were no correlation between 
age and brain weight, this selection, based as it is on age, would, 
of course, be random as regards brain weight. Now correlation 
does exist between the two, but it is so slight that, within the limits 
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of error, the ‘ Young ’ series does form practically a random sample 
of the ‘ Total ’ series, ets is shown by the following figures :— 

Difference in Variation Constants between Young and 
Total Series (written with a positive sign when the 
Young Series gives the greater value). 



Male. 

Female. 

Swedes 

Bavarians 

S.D. 

+ 2*851 ±4.0GG 
-l-888±3-556 

C. of V. 

+ 0 122±0-29I 
-0-173±0'234 

S.D. 

+ 4'"86±.">-465 
-10-357±3-909; 

1 1 



Thus in only one case, that of the Bavarian females, is the differ* 
ence between the variabilities, S.D. or C. of V., of the two series as 
great as its probable error, and even in that case the differences, 
10-357 and 0*941, are not three times as large as their respective 
p.e.’s, 3*909 and 0-320. Dr. Pearl concludes from these and similar 
results that ‘ the series arc reasonably homogeneous in other respects 
than age.’ 

The reader is recommended to test his knowledge of the forraul® 
for probable errors by applying them to the foUo^ving examples. 
Dr. Alice Leo, in a note on Dr. Ludwig on Variation and Correlation 
in Plants [Biometrika, vol. i., p. 316] makes use of the statistics 
relating to Ficaria Verna in Example (4). Those in Example (6) 
are taken from among a large number of others in the highly 
interesting memoir, On the Laws of Inheritance in Man, by Professor 
Karl Pearson and Dr. Alice liCe {^Biometrika, vol. ii., pp. 357 et seq.'\ 
cited once before. 


Exam'ple (4).—Variation and Correlation in Fioabia Verna. 


No. of ObscrvatioDi. 

Mean No. of 
Peub \ S.D. 

Mean No. of 
SepaU; S.D. 

Correlation between 
No. of Sepals and 
No. of Petals. 

1000 (Oreis A) • 

1000 (Oreiz 0) 

8-280; 1-3382 
8-232 ; 0-99&4 

3*695; 0 8524 
3-437; 0 7033 

0-2439 + 00201 
0-2480+00200 


We have here all the data necessary to find the p.e.’s of the 
means, variabilities, and correlations, and we wish to know whether 
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the differences between the means and variabilities of the A and G 
plants can be accounted for by random sampling alone. 

For example, the difference between the petal means 


= (8-286-8.232).- ^ /[(±3382)^ (0-99o4)3 

vL lOUO 1000 

=0-054 -i- 0035. 


Clearly this difference, being not so great as twice its p.e., is not 
significant and may quite well be <luc to random sampling. 

Again, the difference between the petal variabilities 


= (1.3382-0.9934)±i 

V L 2000 2000 J 


=0-3428±0025 


which is certainly much too great to be explained away by random 
sampling merely. 

Similarly the differences between the sepal means, between the 
sepal variabilities, and between the correlations, may be tested for 
significance by comparison with tlieir p.e.’s. 


Example (5).—Sizk axd Variability of Stature in the 

Two Generations. 


1 

1 

Father. 

Mother. 

Son. 

Daughter. 

Mean height (in.) 

67-68±006 

62-48±00o j 

68-65±0-05 

63-87 ±0-05 

S.D. (in.) . 

2-70±0.04 

2-39±004 

2-71 ±0-04 

2-61 ±0-03 

C. of V. (per cent.) 

3-99±0.06 

3-83i006 

. 3-95±006 

409±005 


The student in this case might use one of the formulae for the 
p.e.’s to find the number of fathers, mothers, sons, or daughters 
observed when the p.e.’s are known, and then the remaining p-e.’s 
might be verified when the numbers of observations are found. 

As evidence of ‘ assortative mating,’ 


the tendency of like to 


mate with like, the following particulars are given, based on 1000 


to 1050 cases of husband and wife :— 



Correlation between stature of husband and stature of wife=0 &^04 ±0*0189 
,, span ,, „ „ span „ „ =011>8y±0(C>04 

,, ,, forearm ,, „ forearm,, ,, »0*1977±0*0205 



To measure the average intensity of inheritance, the extent of 
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resemblance between parents and children in any character, co¬ 
efficients of correlation are calculated such as the following :— 


Coefficient of Correlation 

between etatture of father and stature of son —0r>14±0015 

M ,, ,, dauj;liter=o..‘)]o±n4iWi 

mother.son =(.MtM±00ia 

daughter=ti*507±0*016 


>> 


>9 






99 

V» 


*» 


>» 


99 


I 9 


99 


9 9 


[In verifying the p.e.’s for this case take the number of observa¬ 
tions to be 1024.] 

One more extract may be quoted, a prediction table, giving the 
probable mean stature of sons of fathers of given stature, and 
80 on;— 


Son’s probablestatur6 = 33'73-f-0'516 (father’s stature)±1'56 
Daughter’s „ „ = 30500-493 ( „ )±i :.l 

Son’s „ „ ss 33 6.*> + 0-56" (mother’sstatuie; ± 1-'<9 

Daughter’s „ „ =29 28-H 0 554 ( „ „ )±l-52. 


All values given in this example for the p.e.’s should be 
verified. 

Before we consider further applications of these principles to 
questions of a somewhat different kind, let us imagine a very 
simple though artificial illustration. Suppose we have 999 sheep, 
each one ticketed, the numbers on the tickets running from 1 to 
999. Also suppose 666 of those sheep are white and 333 are black, 
BO that, if we pick out any one at random, the chance of it being 
black is 333/999 or 1/3. Let us caU picking a black sheep a ‘success,’ 
then p—lfZ, 5=2/3. 

We proceed now to select 99 sheep in succession at random 
from the flock with the xmderstanding that each sheep is returned 
into the flock before the next is picked out. This insures that 
the chance of a success at each selection remains equal to 1/3 and, 
of course, there is nothing to prevent the same sheep being picked 
more than once. The selection might practically be made by 
placing in a box 999 tickets, numbered from 1 to 999, one to corre¬ 
spond to each sheep, then picking out 99 of them in succession, 
being careful to replace each and to shake up the box before picking 
out the next; if there were absolutely no difference between the 
tickets, such as would cause one to be picked more easily than 
another, the selection made in this way would be random in the 
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sense required, and the tickets so chosen would determine wliicb 
sheep were to be taken and whieh left. 

The proportion of black sheep to be expected in such a random 
selection of 99 is 1/3, but, if we only perform the experiment once, 
it is quite likely that the proportion we actually get will differ from 
1/3 by an amount 

=0-6745-v/(?7/^i) 

=0-(i745\/(J . § • A) 

= 1/31, about, 

while it is unlikely that the proportion \vill differ from 1/3 by much 
more than 3/31, or 1/10. 

Conversely—and it is really the converse which is useful in prac¬ 
tice—if we do not know the proportion of black sheep in the whole 
flock, we may get a fair estimate of it by taking a random sample 
of 99 sheep (any other number will serve the purpose, but the 
larger the better for accuracy), and if we find that in this sample 
there are 33 black sheep, i.e. p=33/99=l/3, it will appear that 
the value of p for the whole flock is 1/3, subject to a probable error 
0'6745v ill excess or defect, i.e. the true proportion for the 
whole flock may quite likely differ from 1/3 by as much as 1/31, 
but it is unlikely to differ by much more than 1/10. It should be 
noticed that the calculation of the probable error in this converse 
case is based upon the value of p given by the sample taken, for 
that is the only value of which we have knowledge. 

Too much stress can scarcely be laid on the fact that the samples 
chosen must be absolutely unbiassed, otherwise the use of the 
formula} np and '\/{npq), or the corresponding proportional formul®. 
cannot be justified : each sheep in our illustration must have the 
same chance of being picked, and no one selection is to have any 
influence on another. The failure to appreciate this essential 
point has led to no little waste of time and effort in the collection 
of valueless statistics. 

The method of sampling has been employed in a way at once 
interesting and useful by Dr. A. L. Bowley, and, as some of this 
work has barely received the attention it deserves, it may be well 
to explain two of his experiments in some detail. 

The first was of interest because its results could be tested by 
an examination of the original record from which the sample was 
taken. The details concerning it are abstracted from the Journal 
of the Royal Statistical Society. September 1906. 

Example (6).—Bowley sampled the dividends paid by 3878 
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companies aa quoted in the Investors' Record. His sample con¬ 
sisted of 400 of these companies, t.e. about 10 per cent., selected in 
a purely arbitrary fashion thus ; the investigator took a Nautical 
Almanac and noted down the last digits of one of the tables, record¬ 
ing them in groups of four, but if any particular group gave a 
number bigger than 3878 he rejected it. In tlxis way each of the 
numbers between 1 and 3878 had an equal chance of selection (for 
numbers under four figures would appear like 0327, 0042, 0009, 
which would be taken to represent 327, 42, 9 respectively), and the 
selection of one had no influence on that of any other. The com¬ 
panies in the Investors' Record were numbered consecutively, and 
the dividends corresponding to the 400 arbitrary numbers obtained 
formed the sample with which Bowley worked. 

After maldng some interesting deductions with regard to the 
average for the whole distribution, to which we shall return pre¬ 
sently, he proceeded to forecast the grouping of the original com¬ 
panies as to their dividends by setting out the grouping discovered 
in the sample 400, as follows, using the standard deviation in place 
of the probable error as the error due to random 8am])ling :— 

Table (39). Distribution of Dividends paid bv a 

Sample of 400 Companies. 


(1) 

(2) 

(2) 

(4) 

Diyidend. 

SAinple of 
400 

1 1 
1 

PercentAge of Sample : 

Percentage of 
allComfianics 


Cotnpftnies. 

Companies in each Class. 

1 

1 

in each Class. 

Nil . 

28 

7 with S.D. = 1'27 

6 

£l to £2, 198. 9d. 

6 

H 

1-5 

£3 to £3, 93. 9d. 

37 

9* =1-46 

8-4 

£3, lOs. to £3, 193. 9d. 

71 

17} „ =1-90 

18-8 

£4 to £4, 9s. 9d. 

64 

16 M =183 

17-3 

£4, lOs. to £4. lOs, 9d. 

63 

13} „ =1-68 

13-8 

£6 to £5. 193. 9d. 

60 

15 „ =1-78 

17 7 

£8 to £7. 198. 9d. 

48 

12 „ =163 

10-8 

£8 to £10, 193. 9d. 

29 

7} .. =1-29 

3-8 

Above £11 

4 

1 

1-9 


In col. (3) the S.D. for each group was calculated as follows:— 
for the first group : out of 400 possible events we have 28 succe-ssful 
events, meaning by ‘ successful ’ here ‘ a company pajdng no 
dividend,* thus 

28/400. <7=372/400. 
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Hence the S.D. of the frequency in the first ffroup 

= vl 28 (l-xVii)] 

= V(28x872)/20 
= 5-1. 

Since this is for a sample of 400, the S.D.of the percentage* frequency 
in the first group 

= J{5-l)=l-27. 

The other S.D.’s are calculatc-d in the same way, but when the 
number in a class is very small the forecast can scarcely be relied 
upon and consequently the S.D. is not inserted. 

It will be noted, by comparing with the numbers in col. (4), 
sho\ving the corresponding percentages for all the 3878 companies, 
that every forecast was remarkably good except one, class tS to 
£10, 19s. 9d., where the error approaches three times the S.D., and 
the exception will serve as a warning that, in working with samples, 
tlie unexpected sometimes happen.?. Professor Edgeworth, in his 
Presidential Address to the Pvoyal Statistical Society (1912). points 
out that the method appears to be a permanent institution in 
the Statistical Bureau at Christiania, where it has given very good 
results. These can be checked or ‘ controlled ’ for safety if complete 
statistics are obtainable under some heads. He fairly sums up the 
utility of sampling when he says that ‘ we may obtain from samples 
a general outline of the facts—often sufficient for the initiation of 
a project like that of insurance—rather than the features in detail.’ 

Bowley also divided up his 400 random samples into 40 groups 
of 10 companies each, and calculated the average for each group. 
The S.D. for these 40 averages was found in the usual w’ay, giving 
0-775. But since this was the S.D. for averages of 10, we conclude 
that 

(the S.D. forthe distribution of the 400companies)/'\/10=0-775 

i.e. the S.D. for the distribution of the 400 companies^ 0-775 

Hence, applying the same principle again, 

the S.D. of the average of the 400 sample companies 

=0-775V10/v'400 

=£ 0 - 122 . 

[• It would not be correct to take <[7(1 - iS®)] as the S.D. of the percentage 
frequency in the first group ; this value would be double the true value, namely* 
i v'[ 28 (l - T\A)l = i v'[7(l - tItt)]. because the accuracy is increased by inerea^g 
the number of events in a sample, and the eample here is really 400 and not Iw 1 
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Now the average of the 400 samples turned o\it to be £4-7435. 
Hence it was judged that, if this was a fair selection (and tlie random 
method adopted was such as to make it fair in all reasonable likeli¬ 
hood), the average for the 3878 companies should certainly lie 
between 

£[4-743r>:i:3(0-122)]. 

The true average was found by actual calculation to be £4-779, 
well within the above limits, although the original items varied from 
nil to £103, being grouped according to the nature of the security 
—Government, Railways, Mines, etc., etc., and the averages and 
S.D.’s on successive pages differed materially. This aggregation, 
Bowley remarks, is very similar to that found in wages in different 
occupations and localities, and in many other practical examples. 

The value of the second experiment due to Dr. Bowley lies in the 
suggestion that similar means can be applied with good results to 
the investigation of many social phenomena. 

If out of a large group a comparatively small sample of statistics 
is collected in the purely random manner already described, we are 
able by such means to estimate what is the average, and even to 
obtain limits between which the average will almost certainly lie, 
in the large group based upon values found for the average and 
S.D. in the small sample. 

Example (7).—With the collaboration of Mr. Bumett-Hurst and 
a number of other workers. Dr. Bowley conducted an inquiry into 
the conditions of working-class households in four representative 
towns—Northampton, Warrington, Stanley, and Reading—the 
results of which are published by Messrs. Bell and Sons under the 
title of Livelihood and Poverty. They are similar in character to 
those obtained by Rowntree in his study of conditions in York, 
but what is peculiar to Bowley’s inquiry is that only a sample, 
about X in 20, of the working-class houses in each town was 
examined, and the conditions in the towns as a whole were deduced 
from these samples. 

We are not concerned here with the actual facts disclosed by the 
investigation, striking as they are, but with the explanation of the 
sampling method adopted, and as to that it may be remarked that 
the formdation on which it rests is precisely the same as that which 
underlay the example of the 999 black and white sheep. The 
main point to notice here again is that Bowley was careful to select 
his samples in unbiassed fashion as follows : * for each town a list 
of all houses . . . was obtained, and without reference to anything 
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except the accidental order (alphabetical by streets or otherwise) 
in the list, one entry in twenty was ticked. The buildings so 
marked, other than shops, institutions, factories, etc., formed the 
sample.’ It will be evident that this method of choice is not quite 
on the same level of randomness as that followed, for example, in 
drawing cards from a well-shuffled pack, each card to be replaced 
and the pack reshuffled before the next is drawn; but, for that 
very reason, the results of the experiment are all the more likely 
to be well within the limits of error provided by the formulae of 
the ideal case. The deliberate selection of every twentieth house 
in each street is likely, that is to say, to give a more representative 
picture of the town as a whole than would be obtained by selecting 
the same number of houses in a purely random fashion which might 
by chance give too much emphasis to some street or district. 

A practical test of the goodness of the sample was possible by 
comparing the results in a few instances with information available 
from other sources. In order to make the method of working 
quite clear, let the guiding principle first be recalled :— 

‘ If, in a random sample of n items, the proportion of successes 
is p, then the proportion of successes in the universe from which the 
sample is selected will not be likely to fall outside the limits 

P±3(0-6745)-v/(P?/»). 

and, if that universe contains altogether N items, the numbtr of 
successes will not be likely to fall outside the limits 

Np±3(0-6745)NVCpff/«)-’ 

In Reading the total number of all inhabited houses m the 
borough was 18,000 at the time of the inquiry, i.e. N<=18|000. 
The total number of houses visited was 840, t.e. «=840. If 
call a house assessed at £8 or less a ‘ success,’ the number of such 
houses found in the sample was 206. 

Thus p=206/840, g=634/840, 

and the number of houses rented at £8 or less in the whole borough 
should be 

Np with a p.e.=0’6745N\/(pg/n) 
t.c. 4414±180. 

The actual number of houses so rented was known from other sources 
to be 4380, well within the limits forecasted. 

The value used for p in the above is that given by the sample 
but when we know the actual number of successes in the umverse 
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as a whole, as in this case we do, we might use the true value of 
p, i.e. the value for the universe in place of that for the sample. 
The argument might also be put in another way without affecting 
the principle employed, thus ;— 

The number of houses rented at £8 nr less in the whole borough 
was 4380. 

But the proportion of houses sampled in the whole borough was 
840/18000, i.e. 1/21-43. 

Hence the number of houses at the above rental to be expected 
in the Bample=4380/21-43=204. 

The number actually found in the sample was 206, with a probable 

—0-6745V(ni>9) 

=0*6745\/ (840 X X 

b8 , approximately. 

Again, the number of persons engaged in a certain occupation at 
Reading was known to be 761 in the borough as a whole. Hence 
the number of persons so engaged to be expected in the sample 
was 761/21-43, i.e. 35. 

The number actually found in the sample was 29 with a probable 


error 


s=0-6745-v/ i^npq) 

=0‘6745-\/ (840 x X &) 
=4, approximately. 


Further examples of the method are here given, in each of which 
the total number of events is small so that the number in each 
sample is also small, and since, as we have seen, the accuracy or 
precision of the proportion of successes discovered in any sample 
varies directly as the square root of the number of events the sample 
contains, the results cannot be expected to be so good when this 
number is small. 


Example (8).—614 candidates sat a certain examination paper; 
their marks ranged from 3 to 64. The candidates were numbered 
consecutively from 1 to 514, and a random sample of 90 (17^ per 
cent.) was selected from among them by writing down the 90 
numbers formed by the digits in the seventh decimal place, taken 
in groups of three, in the logs of the numbers 10104, 10204, 
10304, . . . , as given in Chambers’s Tables, neglecting all numbers 
greater than 614 and calling such numbers as 005, 037, etc.—5, 
37, etc. In this way each of the numbers between I and 514 stood 
an equal chance of inclusion. 
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The distribution of candidates in the sample is compared tvith 
that for all 514 together in the following table :— 


No. of Marks Obtained. 

PcrccTitage of AH 
Caiidiiiates who obtained 
theee Marke. 

Percentage of Candidates in 
Sample who obtained 
these Marks. 

Los than 15 

8 

p.e. 

8±l-9 

15 lint less tlian 25 

19 

17±2-6 

2.> .. „ 30 

16 

18±2-7 

30 .. .. 35 

18 

13±2-4 

35 .. „ 40 

15 

17±2-6 

40 „ 50 

19 

18±2-7 

50 and over. 

7 

10±2-l 


The reader might verify the p.e.’s given in the last column: 
e.g. proportion in the sample obtaining less than 16 marks—7/90; 
therefore ^=7/90, g—83/90. 

Hence the S.D. for this grou]! 

=V[7(l-4)] 

=254. 

and the S.D. for the percentage 

=Vff-x2-54=2-8. 

Thus the p.e. for the percentage 

= |cr=l-9, approximately. 

Example (9) deals in a similar way with the data concerning 
infectious diseases in 241 towns in England and Wales previously 
recorded on p. 62. 

A sample of 60 towns, t.e. about 25 per cent., was chosen in a 
random fashion as in the last example, and the sample distribution 
is compared below with that of the 241 towns as a whole. 

The verification of the probable errors in this and the next case 
is left to the reader. 


Case Rate per 1000 
of the Population. 

Actual No. of 
Towns so rated. 

No. as suggested hy 
the Sample. 

1 and under 5 

85 

p.e. 

92±10 

6 9 

86 

96±10 

9 „ 13 

42 

28± 7 

13 and over. 

28 

24± 6 
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Example (10) is concerned with the annual outj)ut per head in 
142 different types of employment as given in 1907 by the Census 
of Production [data from Sixteenth Abstract of Labour Statistics of 
the United Kingdom, Cd. 7131]. The distribution suggested by a 
random sample of 50 different occupations is compared with that of 
the complete list of 142 occupations. 


Output per heftd. 

No. of Occupations 
in •Sample with 
this Output. 

No. inCom|>lete 
Lint an tieJuced 
from Sample. 

1 Actual No. 

found in 
Complete List. 

Under £60 

4 

p.o. 

n±3-6 

12 

£60 and under £80 

16 

45±62 

42 

£80 „ £100 

6 

17±4-3 

25 

£100 „ £120 

10 

28±53 

20 

£120 „ £190 

8 

23i4 9 

27 

£190 and over 

6 

17±4 3 

16 


The S.D. in each of the last three examples has been calculated 
by using the value for p given by the sample, which is the value 
one must fall back upon in practice when the true p for the whole 
distribution is unknown. In any case where we are able to test 
our sample by comparison with the whole distribution, however, 
it is possible to use the true value of p, e.g. in Example (10) 
output £100-120, p—20/142 as opposed to 10/50. 


H 








CHAPTER XV 


CURVE FITTING-PEARSON S GENERALIZED 

PROBABILITY CURVE 

It may be recalled that in the introductory chapter an outline waa 
given of the manner in which the theory of Statistics might be 
conceived to develop. It was shown how the desire for simplifica¬ 
tion and the need for compression leads to the division of a large 
mass of figures dealing with any given matter into groups ; indeed, 
it may well be that the statistics have been so arranged at the 
source in the act of collecting : e.g. we may have to deal with 
so many males of height 54 in. and less than 55 in., so many of 
height 55 in. and less than 56 in., so many of height 66 in. and less 
than 57 in., and so on. Here corresponding to each given height, 
which we may label x, or each range of height, such as Xj to Xj, 
we have a certain frequency of males of that height or range, 
which frequency we may label y, and hence a frequency table can 
be formed showing the variation of g with x. Further we have 
seen how such pairs of corresponding values of x and y can be 
plotted so as to picture the complete observed frequency distribution 
to the eye. 

Now the representation thus made, though helpful up to a point, 
is not entirely satisfactory. Whether we simply join up successive 
points (x, y), or set up rectangles of varj-ing height y on bases 
spanning the successive ranges of x, or erect ordinates (y’s) at the 
mid-points of these bases, joining the summits in the manner 
previously described, the connection so established between each 
observation and the next is too superficial, depending merely on 
the fact of casual neighbourship, and may sometimes give a false 
impression of frequency and changes in frequency in the population 
of which the observations are but a sample. And this is neces¬ 
sarily so if w'e confine ourselves strictly to the data observed. 

One difficulty which has to be faced is that only within certain 
broad limits can we trust our observations to give us information 
which is truly representative of the population in which we are 
1<8 
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interested. We seldom if ever deal wnth the whole population : 
in fact it may be so large that it is impracticable even to reckon it; 
instead we make a random or unbiassed selection of a smaller but 
adequate number of individuab belonging to the population, and 
classify them according to the size or nature of the character which 
concerns us. But, granted that our sample b adequate in size 
and unbiassed, the numbers obtained in the different groups of the 
frequency distribution will still be subject to the errors of random 
sampling, and it is only after these errors have been calculated that 
we can lay down the probable limits within which our sample may 
be regarded as really representative of the population as a whole. 

Another difficulty arbes o\ving to the fact that our observations 
in general do not cover the whole field of values of the variables x 
and t /; we may quite likely want to know the percentage frequency, 
y, of individuals with a character (height or whatever it may be) z 
which does not chance to be any one of the z’s observed, if the 
observations are only recorded according to discrete (separately 
distinct, like 6 ft., 6 ft., 7 ft.) values of a;; on the other hand, if 
the observations have been classed in groups, the frequency in 
which we are interested may refer to an z which does not coincide 
with the centre of any group or which is even outside the range 
altogether. We have therefore further to inquire whether such 
information can be deduced in any way from the statistics collected. 

Now it so happens that both these diflSculties dbappear if we 
can only attain the ideal already outlined in discussing graphs, 
and find a suitable curve to ‘ fit ’ the statistics observed. Such a 
curve would not necessarily pass through all or any of the points 
(z, y) representing the observations, for these, as we have remarked, 
are subject to errors of random sampling and the observed frequency 
y of any z may be greater or less than the corresponding y in the 
population at large to which the curve b presumed to approximate. 
The curve in short must remove the roughnesses which are in¬ 
separable from ordinary observation. Moreover, given any z, not 
merely one of the z’s observed, it must be possible to read off from 
it the corresponding y, the frequency appropriate to that z. 

It b not always accurate enough for our purpose to di’aw a curve 
by eye. passing as evenly as possible through the middle of the 
points observed in the manner conceived in an earlier chapter. It 
is necessary in some way to find an algebraical formula, po.ssibly 
even a trigonometrical, exponential, or more complex expression, 
which will give the y corresponding to any x desired. This formula 
or equation must depend upon the statistics collected : i.e. the 
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constants involved in it must be directly and fairly easily computed 
from the y's observed, and the results of all the observations should 
enter into the equations which determine the constants in order to 
make use of the full information at our disposal. In addition, the 
method of determining the equation and its constants should be as 
general as po.ssible, so reheving us of the trouble of discovering a 
new method owing to the failure of the original one at nearly every 
trial. Finally, the equation should not be so intricate as to make 
the labour of calculating y for any given x too heavy to be attempted 
with the ordinary equipment at the statistician’s disposal. Once 
such an equation is found it is a fairly straightforward proceeding 
to trace the curve for which it stands, and it will remain afterwards 
to test the goodne.ss of fit in some more refined way than by seeing 
how closely it passes through the observed points by eye. 

When we come to review the shapes of the frequency polygons 
or histograms most commonly met, we find that the majority 

of them start from low fre¬ 
quency, rise to a maximum 
as X, the character observed, 
increases, then fall again to¬ 
wards zero very likely at a 
difierent rate. In fact the 
statistics suggest a shape something like that shown in fig. (27) 
for the corresponding frequency curve, though we cannot be sure 
that it would coincide with the axis at either extremity. [Cases 
do occur where the curve has two or even more humps (maxima), 
but we purposely restrict ourselves to the simpler and more frequent 
type described.] 

Now the simplest shape to deal with from the algebraical point 
of view would certainly be symmetrical in character, corresponding 
to statistics which rise and fall at the same rate, though this would 
not necessarily be the most common shape among the records o 
actual life. In order to simplify our problem, therefore, we rdght 
start by making up for ourselves an ideally simple set of statistics 
which are perfectly symmetrical, and see whether we can discover 
a process for fitting a curve in a case of that kind. If this prove 
successful it might be possible afterwards to adapt the same pro<^ 
to an unsymmetrical or ‘ skew ’ set of statistics made up in a sin^^ 
way. Then finally we should inquire whether actual observations 
conform to any of the types of curve discovered, and, if so, how 
they can be fitted together. 

Now in manufacturing our statistics we must keep before us the 
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object at which we are aiming. Given the statistics, what we 
want is a formula, algebraical or of some other kind, to fit them. 
This raises the possibility of choosing the statistics themselves in 
some algebraical form, and such a form is at hand in the binomial 
expansion, which is, in fact, one of the first examples of a general 
symmetrical expression one meets. Thus 

(a+6)^=a-|-6 

(a+6 )^=a* + +6 * 

(a+ 6)3=a3-|- 3a26+ 3ab^+ 6» 

{a+6)'*=a^ 4a36+60^63+4a6*+6^ 
(a+6)S=a'^+5a^6+10a363+10a263+5a6«-)-6» 

, • • 

(o+6)'’=a'*+no"“^64'^^Y2-—• • • 

^ 1-2 


Clearly all these expressions become perfectly symmetrical if we 
put 0 = 6 , for they read the same whether we run from left to right 
or from right to left. 

We have already seen what an important part the binomial 
expansion plays in the early stages of the theory of probability : 

when expanded, tells us at once the proportion of times 
on the average we may expect 10 heads, 9 heads and 1 tail, 8 heads 
and 2 tails, and so on, when we toss an evenly-balanced coin ten 
times in succession ; or again, if p is the probability that a certain 
event will happen, and q the probability that it will fail to happen 
at one trial, then the probabilities that it will happen p times, 
(p— 1 ) times, (p— 2 ) times, ... in n trials are given by the succes¬ 
sive terms in the expansion of (p+^l". However, wo make no 
assumption for the moment as to the values of a and 6 , except 
that in the symmetrical case with which we begin they are equal, 
and we have as the successive terms of (a-fa)" :— 


o", na", 


n(n-l) _ 


1-2 


a' 


n(n— 1 ) 


a", na", o' 


Lot us suppose that our observed statistics take the above form 
80 that these terms may be plotted as a succession of ordinates, 
Vv Vt* y#» • • • « yf»+i» associated with abscissse, jBj, Xj, Xj, . . . , x^^j, 
at equal distances apart measured, say, by c ; for convenience we 
may place the origin as in fig. (28), so that 

xi=c, x*=2c, xg=3c, . . . , x«+i=(n-f l)c. 
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ftnd we can then form a frequency polygon, where 

1-2-3 . . . (r-l; 


are typical values of a pair of the variables x and y, each such 
pair defining a vertex of the polygon. 

Now in this case, since the statistics have been artificially built 
up by ourselves and are not in reality a random selection, they are 
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not subject to errors of sampling and the fitting curve should, 
therefore, pass through the summits of all the y% or, perhaps 
better, touch each of the lines joining adjacent summits. The 
curve only differs from the neighbouring outb'ne of the polj’gon in 
that the latter is discontinuous, it alters its direction relative to the 
axis of X by jerks at equal intervals c measured along OX, whereas 
the former must rise gradually and continuously and then fall in 
the same way. This is one sense in which we mean that the fitting 
curve removes the roughness of the observation statistics—it gets 


rid of jerks besides filling gaps in the observations. 

It will be clear that as n increases and c diminishes (and this is 
what we aim at in collecting statistics, though it has not been assumed 
in what immediately follows) the discontinuity in the polygon 
becomes less and less pronounced and the outline of the figure 



approximates more and more closely to the 
curve. Moreover this approximation gains in 
intensity if we make the slope of the curve at 
each appropriate point the same as the slope 
obtained by joining up the summits of adjacent 
ordinates of the polygon. 


Now the expression 


(yr+l-yr)/c 
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is the measure of the gradient from the rth ordinate to the (r+l)th, 
and 


yr+i-yr_ 


C 


(n-1) 


_a" 

c 

=y 


1-2 
n{n— 1) 

■ "~l-2 
n —2rd-l 
rc 


(n—r+l) n(n—1) 
~r 1-2 


(n—r-h2) 


(r-1) 


n—r+1 


-1 


] 


(n—r-i^2) 

[r - D' . 


If this be also taken as the gradient of the tangent to the curve at 
the point midway between (x^, yr) and (Xr+i- !/r+i)- calling this point 
(x, y) we have, since, in the notation of the differential calculus. 

— is the measure of the gradient of the curve at this point, 
dx 

dy Vr^l-yr 


dx 


=y- 


n—2r-\-\ 


rc 


And 




a" n(n—1) 

J/=i(2/r+3/r+i)=^ - 


1-2 


(w—r+2)r n—r+1 
(r-1) ■ L »•* 




Vt 

2r 


(n+l). 


Hence 


Vt 

Thus 


n-2r+l_ 2ry (n+2)-(2r+l) _ 2y 

rc (n+1 )c\ c 


n+1 




dx (n+l)c\ c 

But if we had started with any other two adjacent ordinates 
instead of yr and we should have been led to exactly the same 
relation connecting the corresponding x and y of the required 
curve, for r, which serves to particularize the ordinates, does not 
appear in the relation at all—their individuality has been eliminated. 
The above equation may thus, if we please, be taken as holding 
good for, and therefore defining, aU points (x, y) of the fitting curve : 
it is, in short, the differential equation of that curve. 

The equation may be slightly simplified by transferring the 

origin to the point r(»+2)|. oj, evidently the point O' in fig. (28) 



184 


STATISTICS 


corresponding to the TDd^xiniuni ordinate of the polygon or curve. 
Algebraically, tliis merely means that for x we must wriue 

r _^(«+2)ci, 


2 J 


in the equation, which then becomes 


dy_ 2y / 2a;\ 4ay 


dx (7i-f-l)c\ c 


(71+1)C* 


We may pass to the equation proper of the curve by integration. 
Thus, separating the variables, 


‘ y 


+ 


Therefore, 

where A is a constant. 


logy+ 


(n+l)c2. 


^ xdx= 0. 


(n+l)c» 


+A=0. 


Hence 

where y^ is a new constant. 

This may be written 

where (T*=(n-|-l)c*/4, and it is called the probability curve or normal 
curve of error.* 

Let us now see whether the procedure so far followed is applicable 
in the case of an uns 3 ’mmetrical or skew distribution of statistics. 
With this object we will suppose the frequencies of observations in 
successive groups to be represented by the corresponding terms in 
the expansion 

.... 

X * m 

and as before we can form a frequency polygon by joining the 
summits of the ordinates 

• • • . yn+i=2", 


[ Karl Pearson’s method of getting? the DoriDal curve equation has been 
adopted as the basis of the above discussion, in preference to that usually 
followed, which develops the curve also from the binomial expression but some¬ 
what on the lines of Laplace and Poisson. They showed that the sum of all the 
terms lying within a range t on either side of the maximum term in the expan¬ 
sion of (p + g)'* is approximately 

where whence the equation of the curve is derived. (See Historical 

Aote at the end pf Chapter xvur.)] 
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erected on the axis of x at distances from the origin given by 

X|=c, Xg= 3 c, . . . , l)c, 

the figure being very similar to that in the symmetrical case. 

The gradient of the fitting curve where it touches the join of 
{x„ y,) to (x^+i, 3/^+1) is given by 

dy_ yr^i-yr 
dx c 


and we must try and express the right-hand side as before in 
terms of (x, y), the co-ordinates of the mid-point of the line joining 

(«r. yr) to (x,+i, y,+i). 


We have 

dy irn(n-l)...(n-r+lK^, n{n-l)... (T^-r4-2) J 

da: cL 1-2 ... r ^ “ 1-2 .. . (r-1) J 

_ pn-Y-i n(n-l) ■ . . («-^+2) r n-r+l ~| 

c ’ 1-2 .. . (r—1) L J 


Also 


2y=y,+yr+i= 


2x=x,+Xp+,=fc+(r-}-l)c= ( 2 r 4 -l)c 

n( Ti-l) . . . (n -r+2) . , ^Jn-r 

1-2 .. . (r-1) L T 

Thus 


-r+1 




]■ 


dx 


=?^[( 7 i-i-l) 3 —rCp+ 3 )]/[(n+l) 9 +r( 27 —?)] 

c 


=?^[2(n+l)gc—(j>+g)(2x-c)],'[2(n+l)?c-l-(p-7)(2x—c)]. 
c 

This, being true for all such pairs of values of x and y, is now in a 
form independent of any particular point on the curve we seek ; 
in other words, it may be taken as the difierential equation of the 
curve, and it is evidently of the type 

( g—y) 

dx ^^+yx) 

where a, p, y involve only p, q, n, etc., the constants of the distri¬ 
bution we set out to fit. 
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The equation is simplified if we transfer the origin to the point 
(a, 0), when it becomes 

dy, yx 
dx •yx-\-8 ’ 

where S=j8+ya. 

To integrate, separate the variables as before: 

IH 


yx+S 


dx=0. 


Therefore, 


log 

\fJ 


yx-\-S 

log y+-—- log (ya;+S)+A=0, 

y y 

where A is a constant, 
or y=Be-^^''iyx-\-8fy‘, 

where B is a constant. 

It may be ^v^itten 

y=yoe''‘*fl+a 


( 2 ) 


where k=l!y, a=8ly, and is a new constant. 

This, then, may prove a suitable type of curve to fit a set of 
statistics forming a skew frequency distribution, but the question 
now arises W’hether equations (1) and (2) are the most general 
types possible. Clearly (1) is only a particular case of (2) obtained 
by making p=q, and, this being so, (2) may itself be a particular 
case of some still more general type. 

Light may be thrown on this if we consider the geometrical 
bearing of the differential equation obtained in the last case : 

dy_ y{a—x) 
dx ^+yx 

The presence of y and (a—x) in the numerator of the right-hand 

side of (3) shows that — vanishes when y=0 and when x=a, t'.e. the 

dx 

curve touches the axis of x where the two meet and there is a 
maximum point on the curve at x=a. (Since o is the particular 
value of the organ or character x for w'hich the frequency is a 
maximum, a is of course the mode.) Now these tw'O characteristics 
are the very ones to which we wished to give symbolical expression 
since they serve to describe in broad outline what was agreed to 


(3) 
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be the trend of the majority of frequency distributions—the rise 
from zero to a maximum, at first gradually, then faster, and, after 
passing through the maximum, the fall to zero again, generally at 
a different rate. 

As to the denominator of equation (3), the corresponding equation 
for type (1),before the origin was changed, was similar to equation (3), 
except that it contained no x term in the denominator, and that is 
readily understood when we note that y is a multiple of (j>—q) 
and thus vanishes w’hen p-q- Now, if from (3) w’e get a less 
general type of curve by dropping the x term in the denominator, 
we may perhaps get a more general type by adding an x^ term, and 
even an x® term, an x* term, and so on. In fact there seems no 
reason why the denominator should not be any function of x, say 
/(x), which, however, we shall suppose for simplicity capable of 
expansion in a Maclanrin’s series of ascending powers of x which 
converges quickly. 

We are led to propose, therefore, as more general than (3). the 
differential equation 

dy_ s/(x+6) ^ ^ 

dx px^+qx+r 

We stop at X® in the denominator because it has been found, if we 
may anticipate results to save needless labour, that beyond this 
point the heaviness of the calculation involved and the decreasing 
accuracy of the higher moments that have to be introduced out¬ 
weigh any other advantage gained. The curve or set of curves 
resulting from the integration of equation (4) is known as Karl 
Pearson’s Generalized Probability Curve, and their author has 
stated that, while it comprises the two other types as special cases, 
it practically covers all homogeneous statistics he has had to deal 
with. 

Just as the differential equations in the first two cases considered 
were related respectively to the symmetrical and the skew binomial 
expansions, so is equation (4) related to the hypergeometrical 
expansion 

. . . +®’’C0/"Cr, 

the successive terms of which express the probability that r black 
balls, (r—1) black balls and 1 white ball, (r—2 ) black balls and 
2 white balls, . . ., r white balls, will be drawn from a bag contain* 
ing pn black balls and qn white ones, where (p+9)=l> when r balls 
are drawn in all, each being replaced before the next is drawn. 
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If the terms of this expansion are represented by ordinates of 
which the summits determine a polygon as in the binomial cases, 
the corresponding expression for the gradient of the curve at any 
point is given by an equation of tj'pe (4). We need not go over 
the detailed proof of this statement since it follows precisely the 
same lines as in the previous cases. 

The method of integration of the equation 

y(x-\-b) 

dx px--\-qx-{-r 


depends upon the nature of the roots of the quadratic in the 
denominator which may be written 


px^-\-qx-\-r= 


-i ■ 


\^pr 


-)] 




where K=q^}^r, and it is evident that the quadratic splits up into 
real factors if «{« — !) is positive. This is the case when k has any 

negative value, or when it is positive 
and greater than 1, the truth of which 
may be seen more effectively if the 
curve 

y=/e(«—1), 



K + (>l) 


a parabola sjnnmetrical about the line 
K=\, be drawn, fig (29), by plotting 
y against k. 

Further, the product of the roots of the quadratic 


Pio. (29). 


px'-\-qx-\-r=0 


IS 


4r2 


4r2 


p ^pr q- 


• K, 


so that the roots when real will be of the same sign if « is positive 
and of opposite signs if k is negative. The boundary lines 


*=0 and <=1 


thus divide the whole field into three parts, as shown in fig. (30), in 
one of which the roots are real and of opposite sign, in the next 
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the roots are imaginary, and in the third the roots arc real and of 
the same sign. At the boundaries we get particular cases as 

follows :— 

«=0: this requires q=0, since K=qV^Vr, "hich makes the 
roots of the quadratic equal but of opposite sign, unless p=0 also, 
and in that case both roots are 
infinite ; 

/c=l : the roots are real and equal 
and of the same sign ; 

K — cc'. this requires p =0 or r =0; 
in the former case one root of the 
quadratic is infinite, and in the 
latter one root is zero. 

Thus, returning to the differential 
equation, the curves which result 
from the integration 

fdy f {x-\-b)dx 
J y J px^+qx+r 

are of different types according to the value of which is therefore 
called the criierion. 

Type Roots of px2+ga:+r=0 real and of opposiU sign. 

In this case we may write 

px^-\-qx-^r=p{x-\-a){x—^) 



F:o. (30). 


and so get 


[x-\-b)dx 


lyJp{a-\-x){P‘-x) 
or. transferring the origin to the point (-6, 0). the mode, we have 

-= 0 . 



/-+ /- 
J y •'1 

or 

/-+/ 
J y J: 

where 

a=a' 

Therefore, 

log i/-i 

V 

where A is a constant. 


xdx 

-6-i-x)03'+6^ 

xdx 


= 0 , 


dx 


;+ 


p) a-i-X a -|-/3 p^ P —* 


ifj. 

J fi- 


dx 


•+A—0, 
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log 


Thus _ 

7>(a+/S) 

where B is a constant, 

whence y=B (a+_j-yKa+w 


[a log (a+a:)+^ log (j8—a;)]+log B, 


p 


t.e. 




• (5) 


where v=l/p{a-\-^) and is a new constant. 

This is a skew curve of limited range, bounded by the lines x^s—a 
and x^-j-p, with the mode at the origin. 


Type II. — ic=0. q=0, but not p=0. Roots of px'^-^qx-\-r=(i 

equal and of opposite sign. 

This curve is just a particular case of type I., which reduces to 

y=y.(i-ii) . • . • (6) 

symmetrical about the axis of y (because for any value of y there 
are two values of x, equal and of opposite sign) and of limited 
range boimded by x=—a and a:=+a, with the mode at the origin. 


Typelll. — K—cjz.* p—(i,butnotr=Q. Onerootofpx^+qx+r=Q 
infinite. 

This is the skew binomial case over again. It may be also de* 
duced from type I. by making one root, say , tend to infinity. 
The curve then takes the form 


y=y.( ^+Tl 

a/ 


(-;r 


because j3=/S'+6, so that /5 tends to infinity with Hence 

A ya 


y=yo L 


[(»3T- 


where A=—/3/a;. 
Thus 






6 


I • 


(7) 


a skew curve limited in one direction by the line x= —a, with the 
mode at the origin* 


[* <^though tbeoretioally this type oorresponds to ad iofinite value for In 
praotioe it will as a rule give a reasonable 6t provided k is numerioally greater 
than 4. (See W. P. Elderton’a Frequency Curves and Oorrtlatimt p. 60) J 
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Type /F.—v-j-” and <1. Roots of px-+qx-{-r=0 imaginary. 
p^t «(«—1)=—A^, and the differential equation then leads to 



{x+b)dx 




‘rransfer the origin to the point ^0^ 



1 / rUA fb 

log log ( j + ( - 


i-V- 


2p^l2rX 


tan 


.1 


2rA’ 


where A is 
Therefore, 

where 


a constant. 

/ x*^ 

y=y.('+a^ 

2rA 

—, m= 




and i/o is a constant. 

This is a skew curve of unlimited range in both directions. The 
position of the mode is found by putting ^=0 ^ differ¬ 

entiation, or, what comes to the same thing, is seen by direct refer- 
ence to the differential equation itself. Thus the distance of the 

mode from the origin 



s—i;a/2m. 


Type V.~K=1. Roots of px^+qx-\-r=Q real and equal. 
The equation to integrate becomes 


[dy f 

'-'W 
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Transfer the origin to the point ( ——, O), and this becomes 

, 2i> } 




2p! 


dx 


px- 


log y=A+i log 

P p \ 2pjx 

where A is a constant. 


Therefore, 


y=yoX-'e->/», . 


(9) 


wliero 5——1/p, y=l/6——Y and Wf, is a constant. 

p\ ^p! 

Hero X cannot become negative, so that the curve is skew and 
limited in one direction. The distance of the mode from the origin 


=-{b-±) = ~py=Y/s. 


Type VI. —/c-f-** find >1. Roots of px’-\-qx~{-r—0 real and of the 
same sign. 

Equation becomes 

j'^^r_(x+b}dx 

J y 


J p(x+a)(x+)3) 

x+a^pia-^)‘ x-\-^j 


==A4- 

where A is a constant; 


1 


P03—a) 


[(6—a) log (jr+a)—(6-/9) log (x+^)], 


or, transferring the origin to (—/5, 0), 

log y=A+ ^^Y„) Pog j*-09-a)|^-log 

V=yol^—03-a)p^-“>a: 

y=yo(x—. . . (10) 

where a=p—a, q^={b—a)lp{fi—a), gi=(6—^)/p(/3—a), and yo ® 
constant. 

This is a skew curve bounded by z—a in one direction. The 
distance of the mode from the origin=—(6—)S)=agi/(j,—jj)- 
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Type VII. —«=0, 5=0,2)=0. Bootsofthequadraticpx^-\-qx-\-r=Q 
both infinite. 

This is the symmetrical binomial case over again and the integra¬ 
tion reduces to 

or, transferring the origin to (—6, 0), 

f^= l^dx 

J y Jr 

\ogy=A-\-K\ 

2r 

where A is a constant. 

Therefore y==y(ie~“*^*'’» ■ • • (^1) 

where is a constant and < 7 *=— r. 

This curve, the normal curve of error, is symmetrical about the 
axis of y, where mean and mode coincide, and it is of unlimited 
range on either side of it. 


N 



CHAPTER XVI 


ctTRVB piTTiNO {continxLed )—tub method op moments 
FOR CONNECTING CURVE AND STATISTICS 

We have now completed the first stage of the discussion upon which 
we embarked : we have found by the application of general prin¬ 
ciples various types of curve, represented by different equations, 
which are said to fit more or less satisfactorily a considerable number 
at all events of frequency distributions composed of homogeneous 
material. 

Our next task is to pass from the general to the particular, to 
see how to set up a connection between an actually observed fre¬ 
quency distribution and the appropriate theoretical curve. This 
again seems to break up into two parts—(1) to find a way of decidmg 
which type of curve to adopt in a particular case ; (2) to determine 
the constants of the curve in terms of the observed statistics; but 
since the criterion, k, which distinguishes one type of curve from 
another is itself a function of the constants of the curve before 
integration, it follows that the solution of the first part is incidental 
to that of the second. 

The general method proposed for determination of the constants 
of the curve in terms of the observed statistics is the now well-known 
method of moments due to Karl Pearson, whereby the area and 
moments of the fitting curve are equated to the area and moments, 
calculated from the statistics, of the observation curve. 

If a frequency table be drawn up (see Table (40)) showing the 
number / of observations corresponding to the deviation x of each 
value, or group mid-value, X of the character observed from some 
fixed value, the expression 

^l/l+^2/2+ - ‘ • •+'^r/r+ • • ■ 

is called the first moment of the distribution with reference to the 
fixed value, which may be termed the origin- Similarly, 

*Vi+*V*+ ■ • • +^%/r+ - • ■ 
is called the second moment, Ss^f, the third moment, Sxf, the 
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fourth moment, and so on. The following notation will be found 
convenient for working purposes :— 

, N\_2:xf , 

N Z/’ N Zf 

Dndashed letters are reserved for use when the distribution is re¬ 
ferred to its mean as origin, in other words when the deviations of 
the X’s are measured from the mean X. 


Table (40). 


Deviation. 

Frequency. 

First 

Moment. 

Second 

Moiueut. 

Third 

Moment. 

Font th 
hioiucnt. 


/i 

«j/i 





ft 


*V. 

A A 

^tSt 
♦ # 

^tft 

• . 

• • 

• ■ 

• • 

• • 

• • 


e ♦ 

• • 

• • 

«• 


* « 

♦ # 

« • 


/r 

»rf 



^rfr 

• • 

• ♦ 

• e 

# $ 

0 e 

• • 

• • 


• # 

• • 

ft ft 

• • 

Totals . 

N 

N'l 


N', 

N'. 


Now each N in the frequency table is the sum of a number of 
discrete quantities which only tend to form a continuous series as 
the class intervals are made very small and the number of observa¬ 
tions is made very large. The corresponding frequency polygon 
or histogram, if we drew it, would at the same time tend to become 
a continuous curve, the observation curve. If that limiting stage 
were attainable, if we could actually get an infinitely large sample 
of observations in which the character observed changed by infinitesi¬ 
mal amoimts, we could then replace the isolated /’s of observation 
by the corresponding y'’s, the ordinates of this observation curve, 
and to get the moments we could write instead of the discrete 
sums 

£f, Sxf. iTz*/. . . .. 
the continuous integral expressions 

jy’dx, Jxy'dz, fx^'dx, . . ^ 

taking in the whole sweep of the curve by integrating throughout 
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the range of deviation x. We should then have, if areas and 
moments are equated according to Pearson’s method, 

jydx=jy'dx, jxydx=jxy'dx, jx-ydx=jx-y'dx, .. .,jx^ydx=jx”y'dx, 

where y is the ordinate of the fitting curve corresponding to the 
ordinate y' of the observation curve. 

In practice, however, it is impossible to go to this limit: we 
cannot deal with an infinitely large sample, so we take as large a 
sample as is convenient, calculate the rough moments, N, N'j, N '2 • • •. 
and find approximately what corrections or adjustments are neces¬ 
sary to obtain the moments of the observation curve, a procedure 
which is really equivalent to the determination of the area of a 
curve when only a number of isolated points thereon are known. 

For the full analytical justification of the method of moments 
the reader is referred to Professor Pearson’s original paper, On 
the Systematic Fitting of Curves to Observations and Measurements 
[Biometrika, vol. i., pp. 265 et seq. ; also vol. ii., pp. 1-23], where 
it is shown that ‘ with due precautions as to quadrature, it 
gives, when one can make a comparison, sensibly as good results 
as the method of least squares.’ The latter, which is the traditional 
way of approaching all such problems, is shown to be impracticable 
in a large number of cases, either because the resulting equations 
cannot be solved, or, when they are capable of solution, because 
the labour involved would be colossal. 

Let us consider next how to deduce the area and moments of the 
observation curve from the statistics, in other words how to get 

jy'dx, jxy'dx, jx^'dx, . . 

the integrals being taken throughout the range of the curve, when 
we know the frequencies corresponding to only a certain number 
of values or elementary ranges of the deviation x. 

Now the character observed may be capable of the deviations 
actually recorded and of no values in between, e.g. measurmg 
deviations from ‘ no rooms ’ as origin, we might have /j one-roomed 
tenements ,/2 two-roomed tenements ,/3 three-roomed tenements, but 
there could be no such thing as a two-and-a-half or a three-and-a- 
quarter-roomed tenement; on the other hand, any recorded devia¬ 
tion, Xf, may be only the mid-value (used as a convenient and 
concise approximation) of a group of observations including all in 
the continuous range from —|) to where unit deviation 

is the class interval; thus we might have /j males deviating by 
-1-6 in. from 6 ft. (comprising all the males observed between 6 ft. 
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5^ in. and 5 ft. 61 in.), /j males deviating by +5 in. from 5 ft. (com¬ 
prising all males between 5 ft. 41 in. and 5 ft. 5^ in.), and so on. 
These two cases must be discussed separately. 

(1) When the observations are centred at dejinite but isolated values 

of X. 

The problem is to find 

jx’^l/’dx 

(the nth moment) when we have no definite curve given but we 
know the values of x and ij at a number of isolated points, say 

(^o> y o)> y i)» (^ 2 ’ y 2 )' • • • ' y 


This is equivalent to discovering a suitable ‘ quadrature formula,’ 
i.6. a good approximation to 



Fio. (32). 


in terms of known points 

(®0> (^1* ^l)» (^2> * 2 )* • • • (®j>> ^p)' 

where we have written z in place of x^y’, and we may generally 
take the ordinates to be at equal distances, h, apart. Several 
such formula have been suggested and they vary according as the 
z’s are situated at the ends (fig. (31)) or at the centres (fig. (32)) 
of the h intervals. The second type is perhaps the more useful of 
the two, and we shall work out one formula in illustration of it. 
Consider the first five of the given points, namely, 

(^0» ^o)> (®1» *l)* • ■ • (®4» ^ 4 )* 

As a simple ' curve of closest contact ’ let us find the parabola of 
type 

which goes through these five points, where the c’s are constants to 
be determined. We may without loss of generality take the axis 
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of 2 to coincide with the middle one of the five ordinates, so that 
the known points on the curve become 

{~2h, 2 o), {—hy 2 i), (o, 22 ), (+A, 23 ), 24 ), 


and on substitution in ( 1 ) we get 

2o=<^o-2ci+4c2— 8C3+I6C4. Zi=Co—C1+C2—Ca+Ci. 

Z2 = Cq. 23 =Co + Cj+C2 + C3 + C4. 

Z4=Co-}-2ci+4c2+8c3+16c4. 



Fio. (33). 


These equations are just sufficient 
uniquely to determine the c’s, and 
hence the parabolic curve of closest 
contact, in terms of the five given 
points, but for our purpose it is not 
necessary to find aU the c’s. Suppose 
our object is to find the area of the 
shaded portion of fig. (33) in terms 
of the co-ordinates of the five given 
points. This area 


hr. 


zdx 


=/- 

/• + J »/2 

—j (co+CiX/A+C2X^/A2+C3X®/A®-f-C4a;^/ft^)ii* 

~^Coi+Ci3:-/2/i+C2a:®/3A^+C8X*/4A^+C4a^/5A^~j 

= 00 ^+ 02 ^/ 12 + 04 ^/ 80 . 


- +hn 

I 

_ -hii 


But the equations between the z’s and c’s at once give 

Zi=Co, 2o+24=2(Co+4C2+1604), 2i+Z3=2(Co+C2+<;4)* 

Thus 

8 C 2 +3204 = (2o+ 24 )—2221 
2c2+ 2c 4=(Zi+Zg)—2zj/ 

Therefore 

24c2= 1 6{Zj+ 2 j) — (zo+ 24 ) — SOzj 

24C4 = (Zo+ 24 ) —4{2i+23) + 62 ,. 


Hence, by substitution, the shaded area becomes 

f+hl7 

j^^2da:=A[22+7Tff|l6(Zi+Z3)—(Z 0 +Z 4 )—30z4 

+ TTIir| (Zo+*4) ~ +*3)+ 

=j4;-[61782,-17(2v.+Z4)+308(z,+Z3)], . . ( 2 ) 

67o0 
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these particular ordinates being appropriate when the axis o£ z 

coincides with the ordinate. 

Similarly, it can be shown that 


/. 


+ 3fc/2 h 

2da;=jr7[27zQ-l-17zi+5z2—^ 3 ]. ■ 


(3) 


'-hji 24 

by finding the parabolic curve of closest contact through (0 , Zq). 
{h, Zi), ( 2 ^, za). ( 3 ^. Z 3 ). the axis of 2 coinciding now with z^. 

/•+(P+i)fc 

Now we require J zda; 

(see fig. (32)), and this may be obtained by splitting up the integral 
thus 


+... + r ■*> cz 

Jsh /2 Jsh /2 


h/2 M/2 JShn 

and applying the formulaj (2) and (3) to evaluate these sub-integrals. 
The first and last come under head (3), while all the rest come 
under (2). In fact, we fit together portions of curves of parabolic 
type based on the successive groups of points 

(0, 1, 2, 3), (0, 1, 2. 3, 4), (1, 2, 3, 4, 5). (2, 3, 4, 5, 6), . . . 

(p-4, p-3, p-2, p-1, p), (p-3, p-2, p-l, p), 

and as the points overlap, in the sense that neighbouring groups 
have points in common, the curves dovetail into one another and 
so provide a fairly good approximation to what we want in the way 
of integral expressions giving areas based upon the positions of 
certain known points. 

We have, then :— 


2;dx=—[272 o4*172i+6z2— 
24 


/ 

/. 

£ 

f ^zdx—~— [ 5 I 78 Z 4 —17(Za+Ze)4-308(Z3+Z6)] 

hhn 6760 


[6l78z2-17(2o+Z4)+308(2i+Z3)] 

' 8 A /2 67 GO 

t517823~17(zi+Z6)+308(z2+Z4)] 

6760 


t. 


zdx=^ ^ [51782^2— 17(zp_4+Zj)+308(Zp-8+^i)l 
-IV> 6760 


2dx=—\21zp-i- V1Zp-i-^Bz^x ^ 3 ]* 
24 
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Hence, by addition, 

J_ =g^[64632o+43712i+666922+553723+64632^ 

+437l2^_,+66G92j,_2+55372^3] 

+ A[24 + 23 + 28+ . . . +2p-6 + 2p_4] 

=Ml-1220(2o+2p)+0-7088(2i+2p_i)+M578(22+2^2) 
+ 0-9G13(23 + 2 p_ 3 )+(z, + Z5+ . . . + 2 p_ 4 )]. 

In effect, since z=x^y', this means that to calculate the moments 
from the given statistics we may work simply with the observed 
ordinates or frequencies, as dra^vn up in Table (40), so long as we 
modify the ffrst four and the last four by multiplying them by 
suitable factors. In particular, w’hen the frequencies at the be¬ 
ginning and end of the distribution are very small, that is to say, 
when there is high contact at each end of the frequency curve, 
we may dispense even with the modifying factors also since we 
may assume that before the first and after the last ordinate observed 
there are others which are so small as to be negligible. 

Thus, given high contact at each extremity of the observation 
curve, we may write 

/ zdx=hZz, 

J-hl2 

or, if we take the class interval as unit in measuring x so that A=l) 
this gives 

\yx'^dz=Sfx^, 

where the integral may now be taken as referring to the fitted 
curve, since the moments of the theoretical and of the observa¬ 
tional curves are to be equal, and the integration traverses the 
extent of the curve. When, however, there is not high contact at 
the extremities the same equation holds good if we multiply the 
first and last of the observed/’s by 1'1220, the second and the last 
but one by 0*7588, the third and last but two by 1*1678, and the 
fourth and last but three by 0*9613. 

In particular, when n=0, integrating throughout the curve, 

lydx=i:f=N, . . . (4) 

which, being interpreted, means that the area contained between 
the fitting curve and the axis of x measures the total frequency of 
observations, modified if necessary. 

Also, when the observation moments have been adjusted, if we 
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write fi and fx in place of v and u in the notation previously pro¬ 
posed (see Table (40)), integrating again throughout the curve, 

jxydxjlydx=2:xffi^=fi\, . . • (5) 

and the geometrical interpretation of this is that the foot of the 
ordinate passing through the centre of gravity of the area between 
the fitting curve and the axis registers the deviation of the mean X 
from the fixed origin. 

If deviations are measured from the mean of the distribution 
as origin Z'(x/) vanishes (see also Appendix, Note (5)) so that 
Generally, we have, with the same limits of integration, 

jx^ydx/lydx=Zx”ffN =fi' 

and when the distribution is referred to its mean as origin the 
right-hand side is written 

We now pass to the second case. 


(2) When the observations appear in groups ranging between 
definite values of x, the range of each group as a rule being the same 
in extent. 


Since the usual procedure here is to treat each member of a group 


as though it were centred at the 
e.g. a group of school girls 
each of some weight be¬ 
tween 7 stone and 7 stone 
5 lbs. would be treated as 
if all its members were of 
weight 7 stone 2-5 lbs.— 
this case evidently reduces 
to that already considered. 

It is necessary, however, to 
examine what correction 
must be made for assum- 
ing that all the members 
of the same group have 
the same x. 

Consider again the expression 


X at the middle of that group— 



Fio. (34). 


'dx. 


The contribution to the nth moment coming from the group of 
observations (see fig. (34)) may be taken as the portion of the 

above integral between limits and where 
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Tq is the distance of the centre of the first group from the 
origin 0 . 

But, since all the observations in the same group are treated aa 
if they had the same x, by ( 2 ) this integral may be written 

hL _ — - 

g^[5178(2-o+»'A)"—l7|(-ro+r—2A)''^-(a;o+r+2/i)"!• 

+303| (3:o+;^A)"+(a:o+;TlA)'‘}]. 
where /, is the frequency of observations in the group, and this, on 
expansion in powers of (x^+rA) and h, 

=Vr(a;o+^^)"+z^[240n(n—l}A2(xo+rA)'>-2 

5700 

+ 3/i(rt—l)(n—2)(n—3)/i‘‘(Xo+rA)”“*d- . . .)• 

When we sum for all groups, the expression 

S Vr{^0+^*)" 

raO 

gives evidently the nth moment of a set of isolated variables, 
/o> /i. /a- • • • /p. and hy Case (1) it may therefore be taken as 
being practically equivalent to the required nth moment of the 
observation curve, assuming that there is high contact at each end oj 
the curve. 

The remaining terms, 

+3n(n-l)(n-2)(n-3)/i*(xo+rA)"-*H- . . •}. 

may accordingly be taken as the correction required. 

When n=0, these terms vanish, so we infer, just as in Case (l)i 
that, when the integration is taken throughout the curve, 

Jyrfx=Z‘/=N, . . - (4) bis, 

or, the area between the fitting curve and the axis of x measures 
the total frequency of observations when the class interval k is 
treated as the unit in measuring x. 

Again, when n.=l, the corrective terms vanish, so we likeivise 
infer, as in Case (1), that, with the same limits of integration, 

jxydxfji/dx—2^xf/N=fL\, . . • ( 6 ) bis, 

and that /xi= 0 . 

When n=2, the reduction of the corrective terms gives 
second unadjusted moment = 8 econd adjusted moment+ 
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or, dividing throughout by Shf and bearing in mind the notation 
adopted with the mean as origin, 

= Q, • • • (6) 

when h—l as before. 

When n=3, 

^3 

third unadjusted moment=third adjusted moraent+—ir/,.(rQ+rA); 

but, if we refer the deviations to the mean of the distribution as 
origin, Zf^{XQ-\-hr) vanishes. 

Therefore, • • • C^) 

When n=4, 

fourth unadjusted moment 

=fourth adjusted moment4'^^/r(^o+*'M^+^^V- 

2 oU 

Hence, dividing through as before by ShJ and taking A as 1, 

»'4=/^4+iM2 + BV* 

Therefore, M 4 “*'«“i*'a +578 • • • (8) 

To sum up, the general procedure in Case (2) is to calculate 
N, N'l, N'j, N'a, N '4 directly from the statistics and so deduce 
v'\y v\, v'a, v\- Then, transferring the origin to the mean, the p’b 
become i»j, vt, v^, (see Appendix, Note 6 ), and finally the cor¬ 
rected fx’s are given by 

These adjustments, originally due to Dr. W. F. Sheppard * [Pro- 
ceedinga of the Land. Mathl. Socy., vol. xxix., pp. 353 et aeq.], are 
applicable only when the 
curve of distribution has 
high contaot at each ex« 
tremity as very frequently 
happens. To this case 
we shall confine oiirselves, 
and when it does not hold 
the unadjusted moments 
may be used as a rough approximation failing a more refined but 
p.1an a more intricate adjustment. 

The way in which the three chief kinds of average are related to 

[* To obtein Sheppard's adjastioents we have followed the method indicated 
in Elderton’s Frtqxuney Charot* and Camiation, pp. 28, 29.] 
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the fitting curve is of interest and deserves recapitulation. Whether 
the observations are classed as in Case (1) or as in Case (2):— 

(1) the ordinate dra%vn through the highest point of the curve, 

since the frequency there is a maximum, fixes the modal 
value of X ; 

(2) the median X is determined by the ordinate bisecting the 

area between the curve and axis, since there are an equal 
number of observations on either side of it; and 

(3) the 7nean is determined by the ordinate through the centre 

of gravity of the area between the curve and axis. 


We have still to show how to express the constants of the fitting 
curve in terms of the moments calculated from the given statistics, and 
it will be convenient now to make our approach from the other end. 

Take the general equation of the fitting curve, express its con¬ 
stants in terms of its moments, and substitute for the latter the 
values determined from the statistics, since the basis of the fitting 
is the equalization of the moments of the observational curve and 
of the theoretical curve. This will enable us to determine «, the 
criterion for fixing the type of curve suitable to the given distribu¬ 
tion. When the type has been fixed it is, as a rule, not a very 
difficult matter to express the constants of the particular type 
again in terms of the observational moments. 

Now the general differential equation of the fitting curve was 


dx px^-\-qx-{-r 

hence 


/(px®+gx+ r)dy =/y(x+ b)dx, 
where the integration is to traverse the complete curve. 
Therefore, multiplying both sides by x", 

J(px’’+“-|-g'x’‘+^-)-rx")dy=J(yx"+^+6yx")dx j 


or, if we integrate the left-hand side by parts 

[(px^+^+gx^+^+rx")?/]—|y(n+2px'’+^+n-}-l2x’’-l-nrx’’“^)d* 

=J(yx"+^+6yx")(fx. 

But the expression in square brackets vanishes at both limits if 
we suppose y to be zero at each end of the curve, so that the equa¬ 
tion reduces to 


(l+pn-|-2)Jyx’*+*dx-|-(6-|--9W+l)/y^"<^3;-i-mJyx’*“*dx=0, . . • 


(9) 
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Now if deviations are measured from the mean of the distribution, 
we have 

|t/aaia:=N/Ai= 0 , Jj/a: 3 t;z=N/i 3 , etc., 

and therefore, putting n=3 in the above relation, 

(l+5p)N^j-i-(6+45)N^3+3j'N/x2=0; 

put n= 2 , (l+ 4 p)N/i 3 +( 6 + 35 )N^ 2 = 0 ; 

put 71=1, (l+ 3 j 5 )N/x 2 +^N= 0 ; 

put »=0, (6+g)N=0. 


Thus 6 =— q, and, on substitution in the other three equations, we get 

5/XiP+3/^3(7+3^2r+/x4 =0, 

4/A3p + 2/i23 +/^3=0' 

S/Xjp + ^-rM 2 = 0 » 

three simple linear equations to find p, q, r, the solution of which 
leads to 

p=—(2/a 2/X4—3 /x®3—6/x® 2)/( 10/i 2/1.4—1 8 /z®2— 

q——b=—ii^(ix^+ZtJL'2)/(10fi2^i—lSfx\—l2n-3), 

r=—/U 2 ( 4 /t 2 At 4 — 3 /x 23 )/( 10 /i 2 /.‘ 4 ” 


We have thus expressed p, q, r, and 6 , the constants of the fitting 
curve in terms of the moments of the observed distribution, but the 
results may be rendered more concise by writing 

^,=^^ 2 . . .( 1 °) 

whence 


p=-(2^,-3i?i-6)/2(5^a-6^i-9). .... (H) 

q -6=-V0^2^i) ■ (^2+3)/2(5iS2-6^i-9). . . (12) 

--M,(4)32-3^3)/2(6/32-6)33-9) .... (13) 


And «, the criterion for fixing the type of curve suitable to the 
statistics given^ is immediately deduced £rom 


/ie=g«/4pf 

=iSi(iS*+3)V4(4^*-3i8i)(2^2'3^2-6) 



Also, since ^ vanishes when a;=- 6 , this fixes the mode relative 
dx 

to the origin. But the origin is now at the mean, so that 
mode—mean=— b= —'\/(/* 2 ^i) • 0 ?*"l" 3 )/ 2 ( 5^2 6^1 9) (16) 


And 

skewnees=(mean—mode)/S.D. 

=^b|^/^f^t) 

-V/5iO,+3)/2(6^,-6^1-9) . 


. ( 16 ) 
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APPIJCATIONS OP CTJEVE FITTINa 

We are in a position now to test the application of these principles 
to given frequency distributions and we shall start by trying to 
find a curve to fit the record of marks obtained by 614 candidates 
in a certain examination (see p. 25). 


Example (1).—This example is chosen because it turns out, 
when we come to evaluate /c, that it is well fitted by the normal 
curve. Type VII, which is one of the simplest and at the same time 
the most important of all the types discussed. Before we start 
the numerical part of the work it will be well to express the 
constants y^ and o of this curve in terms of the moments of the 
distribution. 

The equation of the normal curve is 


y=y»e 




If N be the total frequency, we have by equation (4) p. 202, 




dx 


Put , SO that __(yy '2 and when x=co, f=00 also. 


Thus 


N=y(fuV2j^ e~^di 


=y(/TV2V7r (see Appendix, Note 8) 
»V(2w)<7yo . . . (1) 
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Again 



^/-«o 

=‘^“ . a V2.2,T^f"e-^Pdi 
N Jo 


2'\/2 . a^i/f 


N 




2 '\/2 . CT®2/o V77 

. 


N 


since 


vanlahes at both limits. 


m: 

Therefore, /X2=V2 . ayaVn . a^/N=a*, by ( 1 ). 

In fact, a is simply the S.D. of the distribution. 

And y^=N/V (27r). <t. 


Table (41). Distribution of Marks obtained by 614 Candi¬ 
dates IN A CERTAIN EXAMINATION. 


Mean No. 
of 

Marks. 

DsTiation 
from 33. 

Frequencj 

of 

Candidates. 

First 

Moment. 

Second 

Moment. 

Third 

Moment. 

Fourth 

Moment. 


(x) 

(/) 


(/*") 

(/*>) 

ifx*) 

3 

-6 

6 


180 

-1080 

6480 

8 

-6 

e 


226 

-1125 

5625 

13 

-4 

28 


448 

-1792 

7168 

18 

-3 

49 

-147 

441 

-1323 

3969 

23 

-2 

68 

-116 

232 

- 464 

928 

28 

-1 

82 

- 82 

82 

- 82 

82 

33 

• • 

87 

• a 

• a 

• • 

a a 

38 

+ l 

79 

79 

79 

+ 79 

79 

43 

+3 

60 

-flOO 

200 

+ 400 

800 

48 

+3 

37 

+111 

333 

+ 999 

2997 

63 

+4 

21 

+ 84 

336 

+ 1344 

5376 

68 

+6 

6 

+ 30 

160 

+ 760 

3760 

63 

+6 

3 

+ 18 

108 

+ 648 


— 

— 

614 

-no 

2814 

-1646 

41,142 
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The first 4 moments referred to 33 as origin and with the daaa 
interval, 5 marks, as unit of deviation, are 

-110/514, 2814/514, —1C4G/514, 41142/514. 

The arithmetic mean of the distribution 


=33+5i 

=33+5(-iiJ) 

=33-5(0-214008) 

=31-92990. 

The second, third, and fourth moments referred to the mean as 
origin, and retaining five marlcs as unit of deviation, are given 
(see Appendix, Note 5) by 


i/2=2814/514-4!2=5-42891 
i,3=—lC46/514-3.fi'2-^=0-29296 
1/4=41142/514—4.f 1 / 3 —£*=78-79964, 
After making Sheppard’s adjustments 


these become 


Thus 

Hence 


/^3 —I's*/*4=**'4"“il'2+'5TT, 


/U2=6-34558, ^3=0-29296, /X4=76-11436. 
ft =mV/*^= 0-00056, jSj =/a4//A23 =2-66365. 
*=ft(^2+3)V4(4^3-3)8,)(2)32-3)Si-6) 
=(0-00056)(5-66365)V4(l0-65292)(-0-67438) 
=-0-00063. 


Since « and are small and jSj does not differ greatly from 3, making 
p and q small, we may fit a normal curve to thia distribution. 

The appropriate normal curve is 

where 02 =^^^ 5.34553 ^5 marks as unit), 

Vo =^IV (27r/i2) =514 /-n/^(5-34558)^=88-6903. 

Hence the required curve has for its equation, writing results to 
three significant figures, 

Now the mean of the distribution is at 31-92996, where the 
central ordinate of the normal curve is erected, and the distance 
of any x, say x^, from this point 

=(33 31-92996)/5 (expressed with 5 marks as unit) 
=0-214008. 
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Any other x may be found in the same way and y can then be 
deduced from the equation of the curve by taking logs, thus 


logio y=logio 88-6903— 


X- 


2(5-34558) 


rr:r>gio« 


=1-9478762- (00406218)a^. 


This enables us to calculate the ordinates of the normal curve and 
thence we could evaluate the areas by successive applications of a 
suitable quadrature formula. 

We can, however, get the areas direct by using a table of the 
probability integral, such as that due to Dr. W. F. Sheppard (see 
pp. 284, 285). In that case the corresponding absciss® have first 
to be expressed in terms of the standard deviation as unit, e.g. 

a:„.5=40-5-31-92996=8-57004, 
and ff=5V(5-34558)=ll-56025, 


where the factor 5 is introduced because 5 marks was the unit in 
the calculation of (a process equivalent in effect to that previously 
adopted). 

Thus 2:4 o-6/<'=0-741336 

=i, say. 

The area of the normal curve up to the abscissa x/a or ^ 



=N.i(i+a), 

where - represents the area eff the ourve between 

2 

0 and 


O 
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Slieppard’s Tables give the values of J{l+a) for different values 
of and when 

^=^0-74, ^(l+a)=0-7703500 
1=0-75, J(l+a)=0-7733726. 

Therefore, by interpolation, when 

^=0-741336, ^(l+a)=0-7707538. 

Thus the frequency of candidates with marks lying between 0 and 
40-5 

=514(0-7707538) =396-17. 

Similarly the frequency of candidates with marks lying between 
0 and 45-5 =452-20. 



Fio. (35). 


Hence the normal frequency for the group with 43 as mean 
number of marks=56-0, and the same method gives the area for 
any other group. 

The histogram of the observations and the curve plotted from the 
ordinates are shown together in fig. (35). 

In Table (42) are set out the calculated normal frequency (col. (4)) 
for each group alongside the corresponding observed frequency 
(coi. (2)), and the differences between the two are shown in col. (5). 
We want to know whether the fit is a good one. 
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Table (42) Comparison of Observed and Normal 
Frequencies in Examination Example. 


(1) 

(2) 

(3) 

(4) 

(6) 

(6) 

(7) 

Mean No. 
of 

UarIcB. 

Observed 

Frec|uoncy. 

j Normal Frequency. 

Deviation. 

Sq. of 
Deviation. 

Katio of No. 
in Col, (G) to 
No. in Col. (4). 

Ordinates. 

1 

Areas. 

3 

5 

1 

3-9 


-I-0-7 

0-49 

0-09 

8 

9 

10-4 


+ 1-7 

2-89 

0-27 

13 

28 

23-2 

23-5 

-4-5 

20-2.'> 

0-86 

18 

49 

42-9 

43-1 

-5-9 

34 81 

0-81 

23 

58 

65-8 

656 

4-7-6 

57-76 

0 88 

28 

82 

83-7 

83-1 

-fl-l 

1-21 

1 0-01 

33 

87 

88*3 

87-6 

4-0 6 

0-36 

' 0-00 

38 

79 

77-3 

76-8 , 

-2-2 

4-84 

0-06 

43 

60 

66-1 

56-0 

4-6-0 

36 00 

0-64 

48 

37 

33-7 

340 

-3-0 

9-00 

0-26 

53 

21 

1 

16-8 

17-1 

-3-9 

15 21 

0-89 

58 

6 

7-0 


4-1*2 

1-44 

0-20 

63 

3 i 

mm 

3-6 

4-0-5 

0-25 

007 

• ♦ 

614 

611-6 

513 9 

• * 

184-51 

X* = -504 


Now with this object we might square each difference as in 
col. (6), sum the squares, and find the mean square deviation by 
dividing by the total frequency ; this, after extracting the square 
root, would give what might be called the root-mean-square error, 
regarding the theoretical values as the true ones. In the above 
example it 

=V(184-61/614)=0-699. 

But this form of result, while it may be useful in some cases, 
€.g. in comparing two distributions of the same kind to some 
theoretical series, is open to objection; for one thing it treats all 
the differences as if they were of equal importance in absolute 
magmtude, but a difference of 2, say, in a normal frequency of 10 
is clearly more serious than a like difference in a frequency of 60. 
The objection, however, goes deeper than that; even when the 
root-mean-square deviation is found we are at a loss to estimate 
its p^ise relationship to the quaUty of 6t. as there seems to be no 
definite connection between one distribution and another of a 
different kind : there is no standard case, so to speak, to which we 
can always appeal, where the fit is agreed to be good and supplying 
therefore a suitable root-mean-square deviation for comparison. 
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This leafls us to the question : What constitutes goodness of 
fit ? Suppose by some means we have selected a theoretical or 
empirical formula to describe a certain frequency distribution in a 
given population ; if the frequency values observed do not differ 
from the theoretical frequencies by more than the deviations we 
might expect owing to random sampling, then clearly the fit may be 
regarded as a good one. And we have a measure of the fit if we 
can find the proportion of random samples, of the same size as the 
given distribution, showing greater deviations from the distribu¬ 
tion given by theory than those which are actually observed. 

Now Professor Karl Pearson has shown how this proportion can 
be calculated [Pkil. Jilag., vol. 1., pp. 157-175 (1900)]; he finds the 
probability that a random sample should give a frequency distribu¬ 
tion differing from that wliich theory proposes by as much as or by 
more than the distribution actually observed. This probability, P, 
is a function of where 

y and y’ representing the theoretical and observed frequencies for 
any particular group and the summation is to include all groups. 
It will be noted that this expression gives each difference (y—y) 
its appropriate importance by relating it to the frequency y of its 
own group. 

A table in Biometrika (vol. i., pp. 155 el seq.) gives the values of P 
corresponding to different values of (including all integral values 
from 1 to 30) and to values of n', the total number of frequency 
groups, from 3 to 30 (see also p. 285). The mathematics in¬ 
volved in finding P is difficult, and the reader who wishes to enter 
into it must consult the original memoir, but the utility of the 
function has been proved by experience and it is readily applied 
in a particular case. 

In the above example x^ is found from col. (7) : it equals 6’04 
and from the table of values of P, when n' =13, we have 

P =0-957979 when ^=5. 
and P=0-916082 when x^~^' 

Therefore, by proportional interpolation, when 
P=0-956303. Thus, supposing our data to follow the normal curve, 
in 956 random samples out of 1000 we should expect to get a 
worse-fitting distribution than that given by the sample actually 
observed. We may therefore conclude without hesitation that 
the normal curve provides an excellent fit in this particular instance. 
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We pass on now to fresh distributions to illustrate some of the 
other types of frequency curve. 

Example (2) deals with the percentage of trade union members 
unemployed at the end of each month for the j’cars 1898 to 1912 
[data from the Sixteenth Abstract of Labour Statistics of the United 
Kingdom, Cd. 7131]. Table (43) show’s the distribution of the 
180 records according to the percentage unemployed. 

The deviations arc measured from the centre of the group (3’9—5'2) 
as origin, and the class interval (1*3 per cent.) is taken as unit of 
deviation as usual. 

The first four moments are :— 

—29/180(=x), 425/180, 397/180, 3053/180 ; 
i.e. -01611111, 2’3Clllll. 2-2055556, 16-9011111. 


Table (43). Distribution of Unemployed Percentaoes 

OF Trade Union Members 


PercenUge 
Un«j]n ployed. 

Devia¬ 

tion. 

Fre- 

cjuvucy. 

First 

Moment. 

Second 

Moment. 

Tliircl 

Moment. 

Fourth 

Moment. 

0— 

-3 

1 0 

0 

0 

0 

0 

1-3— 

-2 

33 

-60 

132 

-264 

528 

2-6— 

-1 

57 

-67 

57 

- 67 

57 

3-9— 

4 e 

41 

• • 

e e 

• • 


6-2— 

+1 

24 

+ 24 

24 

+ 24 

24 

6-5— 

+2 

10 

+ 20 

40 

+ 80 

ICO 

7-8— 

+ 3 

11 

+ 33 

99 

+ 297 

891 

91— 

+4 

3 

+ 12 

48 

+ 192 

768 

10-4— 

+ 6 

1 

+ 5 

25 

+ 125 

625 

• » 

1 

4 • 

180 

-29 

425 

+ 397 

3053 


Referred to the mean, 

4-55 -(-1 -3^=4-3405556, 

the second, third, and fourth moments are (see Appendix. Note 6), 

1/2=2-3011111—i®=2-3351543, 

1/3=2-2055556-3xi/,-£3=3-338395, 
i/4=16-9611111-4ii/3-6£2i/2-x*=18-74817. 

Owing to the very doubtful contact at the beginning of the curve 
Sheppard’s adjustments were not made in this case, but the rough 
moments as calculated above were used. 
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Thua jSi=i/Vy%=0-S75242 

j82=i/Ji-=2=3-43817 

and j 3 iOS 2 + 3 )V 4 { 4 ^ 2 - 3 i 3 i)( 2 i 32 - 3 j 3 i- 6 )=- 0 - 466 . 

Since K is negative the fitting curve should be of Type /..the equation 
of wliich is 



where mjai^inja 2 , and (aj+a 2 )= 6 , say. 

It is therefore necessary before going further to determine Oj, 
Uj, b, Trtj and in terms of j/ 2 , V 3 , v^, or and ( 82 , the constants of 
the distribution. 

The value of t/o found to be most conveniently expressed as a 
Qamma function which is defined, with the usual notation, thus :— 

whence it follows that F(i+1)—fcr(i). [See Appendix, Note 9, 
also p. 2S5.] 

Also, if 

B(m, n)=j[ (l-x)"->da: 

it may be easily shown that 

B(m, n)=r(w)r(n)/r{w+n). [See Appendix, Note 9.] 

The general method of procedure in determining the constants 
for all the different types is :— 

1. Express the fact that the area of the curve is a measure of 

the total frequency of the distribution—this enables us to 
find yo- 

2. Find the nth moment of the curve with regard to some fixed 

origin—giving n particular values, 1, 2, 3, 4, this leads to 
the determination of /Xg, /aj, )Si, /Sg in terms of the con¬ 
stants of the cxirve, and thence to formulae for calculating 
the constants. 

Once found, the same formul® may be used, of course, in all 
oases of the same type : we have only to replace letters by the 
numbers for which they stand. 

Applying this method to the Tj’pe I. curve, we have 
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Put {ai-\-x)={a^-\-a.i)z, so that (a2-x)=(a,+a2)(l-2) and 


dx 

—=(01+02)— 6 ; therefore 
dz 


N = 


2 / 06 ( 01 + 02 )"' 


A 




( 2 ) 


Hence 


N m">m “2 r(nii+m,+ 2 ) 


y«=-. 


'® b * (mi+mj)"‘+“2 r(mi+l)r(m,+l) 

C + 

Again, N/x\=J_^ 3/(a,+z)Mx 

is the nth moment of the distribution referred to (—Oi, 0 ), the 
point where the curve starts from the axis on the left-hand side, 

as origin. 

Therefore, as above. 




_ 2/0 


I (ai+x)’"»+’*(Oj-x)’^dx 

' • a« 


62 / o (« i + 02 )""+"^-"’« 


7 2"“ + "(l-2)’^2 
Oi"**a 2 ™« Jo 

= 6 "Njr 7 ">+’‘(l— 2 )’^rf 2 yjf —( 2 )- 


Hence, 

fi'"=6«r(m,+n+l)r(in|+m2+2)/r(ini+l)r(mi+7n2+n+2) 

= 6 "( 7ni+n)(7ni+n—1) . . . (mi+l)/(jn,+Tn2+n+l)(m,+ni2+n) 

. . • (nii+m2+2), 

by repeated application of the relation r(A:+l)=Ar(A). 

Putting n=l, 2, 3, 4 in succession, wo have 

fi' 1 +1 )/(mi+m2+ 2), 

^'3=62(m,+2)(7n,+ l)/(m,+m2+3)(mi+m2+2), 

^'3=63(mi+3)(mi+2)(mi+l)/(jni+m2+4)(mi+m2+3){m,+jn2+2), 
^\=6*(mi+4){mi+3)(mi+2){m,+ l)/(mi+m2+6){7ni+m2+4) 
(7ni+m2+3)(mi+Tn2“l"2). 

These relations are rendered more concise if we write 

mi+l=»n'i, m2+l=m'2, 7ni+m2+2=r; 

thus fi'i=bm'Jr 

^'2=6-m'i{m'i+l)/r(r+l) 

;t' 3 = 6 >m'i(m'i+l)(m'i+ 2 )/r(r+l)(r+ 2 ) 

/4+=6*m'i(m'i+l)(m',+2){m',+3)/r(r+l)(r+2)(r+3). 
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To get the corresponding momenta referred to the mean aa 
origin we have the relations :— 

^2=M'2—^^4=^4 — 
which, after some straightforward reduction, give 
^2=6Wjm'2/r=(r+l) 

^3=26'’m',7n'2(m'2—m'i)/r3(r+l)(r+2) 

pi, =3bWim\[m im\(r- 6)+ 2r=]/r^(r+1 )(r+2)(r+3). 

'Thus 3,/ bWW\ 

»'®(r+l)-(r+2)2 / r6(r+l)3 

=4(m' 2— m'i)2(r+1 )/m\m'2(r+2)^ 

=4(T^—4m\m ';) (r+1 )!m\m' 2 (r+ 

^ _A(r+2)2 


Therefore, 


+4 


(3) 


m'jm'a 4(r-}-l) 

Again, 

r^(r+l)(r+2)(r+3) / H(r+1)“ 

_3[7n'jw'2(r—6)+2r2] (r+1) 


Therefore, 


2r2 


TTl v77t 2 


7 =-r+6+ 


m im 2 (^+2)(r+3) 

/32(»-+2)(r+3) 


3(r+l) 


( 4 ) 


Combining (3) and (4), ^^(*-+2)^ g^Q_ „ (r+2)(r+ 3 ) 

4(r+l) ™ 3(r+l) 

whence '=6U3a-^i-l)/(3^,-2^82+6) • • («) 

Again, since pt-i—h'^rn.\m\lr\r-\-\), 
therefore t>^=pi 2 ir-\-l). DSi(r+2)2+16(r+l)]/4(r+l), by (3), 

*>=y;:7VU5i(r+2)“+16(r+l)] . . (6) 

And m'im'2=4r2(r+l)/[j8i(r+2)2+16(r+l)], 

while 2 =r; hence m\ and m'g are roots o£ 

4r2(r+l) 


m+ 


A{^+2)=*+lC(r+l) 


= 0 , 


. /L._ 16r->+I) 

V L i 3 ,(r+ 2 ) 24 - 16 (r+l)J 


- 1 ; 
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therefore, dIj and are respectively equal to 

r(r+2)vi3, 


r-2± 


ViA(r+ 2 )=+ 16 (r+l) 
and ttj and follow from 


;.]■ 


_ aj _ b 


(7) 

( 8 ) 


Applying these formula to the ‘ unemployed ’ example, we find 

r=5*36048. =0*169185. 7^2=3*191295. 

6=9*33236. a2=0*469842. 02=8*86252. 


Also 2/0=68*1282, and the equation of the curve is therefore 


0 '1«» 


y=58*l 1+ 


0 470 



The position of the origin, which is at the mode, is given by 



thus. 


(mean —mode) =/x' i—Oj 

bm\ brn^ 

f TTlj + Tn, 




771 2—771 J 

r(r-2) 

— - "Z * • • 

mode =4*3405566—i ^ 


*;2 ‘ r-2’ 

OT, allowing for units in applying formula (9), 

=2-3052009. 



When ^ in positive goes with the positive root of the qaAdretio» end 
verso,] 
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This enables us to write down any x, and thence y by substituting 
for X in the equation of the curve, which, by taking logs, may be 
written 


log y=\og log (log ( 1—-V 

^ “i/ \ dil 

e.g. for the x of the group (2-6—3-9), bearing in mind that 1‘3 is the 
unit of measurement for x, we have 

a'325=(3-25-2-3052009)/l-3=0-9447991/l'3. 


Xt.ne. \ - _ -- / X-i. 


Hence 1+ ) =2-546835 ; 


1 - 


3-26 


a. 


m, log j=0-OC86892 ; log ^l-t^^j=-0-118587 j 

so that log ?/=l-714489, 
and ^ 3 . 25 = 61 - 82 . 


a. 


=0-9179953; 


^ 3-25 


Similarly the ordinates at the centre points of the other groups 
may be calculated, but it must be remembered that the resulting 
values are only a first approximation to the observed frequencies, 
and a better series is obtained if, by using some good quadrature 
formula, we calculate the areas for the successive groups between 
the curve, the bounding ordinates, and the axis of x. Indeed in 
the case of the group (1-3—2-6) it is essential to do this, because 
(1) the rise of the curve is so very abrupt as to render the deter¬ 
mination of the single ordinate at the centre quite inadequate for 
an accurate measure of the frequency in that group, and (2) a 
portion of the group falls outside the range of the curve which only 
starts at 1-6944063 (t.c. mode—l*3<Zj), and this has to be allowed 
for in finding the frequency as represented by the area between the 
curve and axis. 

The base of the required area, range (1-6944063 to 2-6), was 
therefore divided into eight equal parts and the ordinates at the 
points of division were determined. The area was then found by 
using Simpson’s well-known formula :— 

Area=JA[(yo+yj^)+2(y2+y4+ • ■ • +y2p-3)+4(yi+y3+ • • • 

where k denotes the length of one of the equal parts into which 
the base is divided and 2p is their number j in our case p=4 and 
I'll® class interval being the unit, and the result is to be 
reduced in the ratio 


0-9055937 : 1-3 
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lo order to aUow for the smaller range of this group ; we thus get 
as the area for the group 
0-9055937 1 

—To—Xo,t(2'»+2/8)+2(y2+2/4+y6)+4(yi+y3+y6+y?)] =37-39. 
1-0 24 

The observed and calculated frequencies for the whole series are 
compared in Table (44), the remaining areas in col. (4) being calcu¬ 
lated by the simpler but somewhat less accurate form of Simpson’s 

formula, when only three ordinates are used, namely, 

/+i 

Table (44). Comparison of Observed and Theoretical 
Frequencies of Unemployed Percentages 


(1) 

(2) 

(3) 

w 

(6) 

(6) 

(7) 

PereeDUj^d 

Observed 

Theoretics! Frequency. 

1 .. 

DevletiOD. 

Squere of 

Ratio of No. 
in Col. (6) to 
No.inCul.(4). 

Unemployed. 

Frequency. 

Ordioates. 

A rest. 

DuvieUoQ. 

1-3— 

1 

33 

65-3* 

37-4 

-I-4-4 

19-36 

0-52 

2-6— 

67 

61-8 

61-6 

—6-4 

2916 

0-67 

3-9— 

41 

37-8 

37-8 

-3-2 

10-24 

0-27 

6-2— 

24 

24-9 

25-0 

+ 10 

1-00 

0-04 

6-5— 

10 

14-8 

14-9 

+4-9 

24-01 

1-61 

7-8— 

11 

7-7 

7-8 

-3-2 

10-24 

1 31 

91— 

3 

3-3 

3 4 

-I-0-4 

0-16 

0-05 

104— 

1 

10 

1-2 

-1-0-2 

, 0-04 

1 

0-03 

• • 

180 

» e 

1791 

• • 


X®=4-40 


To test the goodness of fit we have n^=8, j^^=4-40, whence, by 
means of the P table, P=0-731852. Thus, roughly, we may say that 
three out of every four random samples of 180 records would give a 
worse fit with the proposed curve than is given by the actual distribu¬ 
tion observed, so that the fit may be regarded as quite a reasonably 
good one. This conclusion is also supported by an examination of 
the curve which has been drawn, fig. (36), with the histogram of 
the given statistics. 

Example (3).—^The data for this example concerning infectious 
diseases will be found in Table (16), p. 62 (or, see p. 224); the 
reader should work out the moments for himself and verify the 
following results:— 

[* The ordinate in this case cannot be aooepted aa an approximation to the 
frequency given by the ourre.] 
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The first four moments referred to 7 as origin are 
0-282158, 4-86307, 17-4855. 129-394. 

Referred to the mean, 7-564316, the three latter become 
1/2=4-78346, 1/3=13-4140, i-4=lll-964. 

IJ trc do not assume high contact at the terminals, and certainly at 
the lower end it is doubtful, we deduce from the above values of 
the moments that 

)3i=l-64396, j82=4-89321, --1-63. 

Thus the fitting curve is of Type I. and its constants, when calcu¬ 
lated, are 

r=ll-7819. 7711=0-31171. m2=9-47020. 

ai=0-79216. a2=24-0671. ^0=60-363. 



4 2 5 6 7 8 

Percentage Unemployed 
Pro. (36). 


The equation of the curve is therefore, retaining three significant 
figures throughout 


y=60-4f 1+ 


0’<12 


«-47 


0-792 


1 — 


24-1 


The curve starts at 2-02904 (so that the first group of observations 
lies wholly outside its range) and ends at 51-7475. It is drawn, 
together with the corresponding histogram, in fig. (37). 

Supposing, just for the sake of comparison, we assume high 
contact at the ter min als and attempt to fit the given distribution 
with a Type HI. curve, to which I^e 1. is closely related. 

We then have, after making Sheppard’s adjustments, 

P2=4-70013, p3=13-4140, /a4=109-601, 

whence ^i=l-73295, jS2=4-96129, «=-l-47. 


It will be noted that the theoretically correct type to take here 
is l^pe I., but this was discarded because, when attempted. 














No. of Towns with the given Rate of Disease 
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it led to a curve starting at a point corresponding to a disease 
of 3-385, so that the central ordinates of each of the first 
observed groups lay outside the curve altogether. 

Ty^e III. curve is of the form 


rate 

two 


y=y(,e"^( 1+ 




.1- 






n. 


— •* V. 


i 

; j*!-t i r* 

"TTu .tu.;?. 

^ ITI - 



-n'riii 


• • -i-'* . 

—•- >l*ype'l| ' f ■ • • 

Jr 

-hi ■. • • i - • 

> W r I j 1 I ' ' ' 

V" 1^1 

l|::d-Hi-nii rH 

•-rV-il-t f-h • • • 


• 1 • ; • 


iu n 


. - 

;: : h'l:,::: :; 


» ♦ *. 1 


:: I r:;: n I'i*: i*; 

? -} I • r - :-j M-- T * 




1 • ‘ • T 


' u.\ i : 

.rH- h i- 


rf-HfrM.H 

• ' r 1 .... . 

. . . ; . . I . I 


Hi.. 


!'h 


; “M* 

Hr 


■ ^ i -frv- L - : . . . . i . . , i • H f I • r f 

t"h/f-fi-Hsiw (' I ' rTT^ '' r^r^ ^ ^ ^ i ■ f ^ 

o S 10 IS 20 25 

Oiseasc Rato per WOO persons living 
Pio. (37). 

To express the constants in terms of the moments, noting that 
curve starts from x=—a on one side and goes ofl to infinity on 
other, we have 

N = ydx 

=!,o£e-«(l+?)’‘dx 

=-^ I (where yft=^P) 


the 

tlie 




.yu 


ey^j^e-iy+r^^iya-^yxfdx 


f e^^z^dz (where ya+yxs*^ 
YP^Jo 

r(.+i,. 

y,=Np»+V«*IXP+l) . 


Tlieretore, 


( 10 ) 
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Again, the nth moment of the distribution referred to (— a, 0) 
« origin is 


N^'„=/ yia-\-x)’*dx 

/-C 

-^re~y^(a+xf+’^dx 

a* J^Q 


_yo 


OP‘vP+"^-a 


y 


oP yP 


. ["e-<>“+i^>(ya+ya:)P+’‘dx 

J 

Jo 


Therefore, by (10), 


f^n = 


pP+i 


Hence, 


ae^r(p+l) ' aPyP+"+^ 

=r(p+n+i)/y«r(p+i). 


. r(p+n+l) 


/t' 1 =r(p+ 2)/yr(p+1) = (P+1 )/y 

M'2=r(f)+3)/y^r(p+l) = (p+2)(p+l)/y* 

M 3=r(p+4)/y3r(p+l)=(p+3)(p+2)(23+l)/y*. 

rransferring to the mean as origin we have for the moments, since 

^=M'i=(P+l)/y 
^2 =m' 2—=(p+1 )/y2 

/*3=M'a—3f/i2—;r>^2(p+l)/ya. 

Hence, combining these last two equations, 

y=2^.^.. p=(V,//,)-! . . . (11) 

In our particular case these equations give 

y=0-700780, i>=l-30820. o=l-86678, 
and, therefore, by (10), 

3/0=55-3323. 

Hence the curve is 

y=55-3e-*'”Vl+^ 

\ 1-87 

The equation of the curve, on taking logs, gives 
log y=log yo-y log . x+p log (l+-^ 

=1-742979—0-304345X+1-30820 log (l+*/l-866781 


i*a 
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Before we can go on to calculate the ordinates of the curve wo 
Deed to know where the origin lies, and since it coincides with the 
mode it may be found from 


Thus, 


mean— mode =/x'j —a 

- - - « • • * 

y 2/^2 

mode=7-564316-2-853960=4-71036, 


( 12 ) 


_ II 

AivJv Mean 


Suppose now we wish to calculate the ordinate corresponding to 
the z of the centre point of group (C—8), we have 

a:,=i(7-4-71036) 

=1 14482, 

bearing in mind that the unit is a rate of 2 per 1000. 

Hence, substituting this value in tiie equation for log y, 

log yj=1-666278 
y, =46-374, 

and similarly any other y may be found. 

The curve starts at 

mode—a=4-71036—2(1-86678)=0-97680, 

so that the range of the first group as determined from the curve ts 
(0-9768—2), and not (0—2) as in the observations. 

The ordinates and afterwards the areas, calculated by a method 
somewhat similar to that indicated in Example (2), were determined 
for each separate group of observations, and the resiilts for both 
Type I. and T 3 q)e III. curves are compared in Table (45). 

Type m. curve is drawn on the same diagram, fig. (37). as Type I. 
curve and the observation histogram, and the result lends emphasis 
to an important point, namely, the necessity for replacing ordinates 
by areas to obtain the frequency proper to any group. 

In order to get a measure of the goodness of fit in each case, 
the function P was calculated, but in the Type I. comparison the 
first group had to be omitted to avoid the infinite term which would 
have resulted in owing to this group falling right outside the 
curve, that is to say, the test had to be confined to towns in which 
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Uie observed case rate was not less than 2. The values found for 
P were ;— 

Type I.—P=0-34307, 

Type III.—P—0-46298, 

so that in every 100 samples containing 241 observations each, we 
should get, roughly, 34 deviating from the Type I. curve and 46 
deviating from the Type III. curve, at least as widely as the given 
distribution. In neither case can the fit be regarded as a very 
good one, but the failure is only marked in one or two groups, such 
IS that of maximum frequency, where there may be other than 
random causes to account for it ; e.g. where isolation is inefficient 
the disease is likely to spread, one case infects another: in other 
words, the events are not independent. 


Table (45). Comparison of Observed Distribution op In¬ 
fectious Disease Rates, notified in 241 large Towns of 
England and Wales, with Theoretical Distribution. 


(1) (3) (3) (4) (6) (6) 


Casa Rato. 

« 

Observed 

Frequenej. 

Theoretical Frequency. 

{fi -mfi- 

i/j -mu 

Type 1. 

Typo III. 


(/) 

(/i) 

(Ta) 


0'39 

0— 

5 

• * 

6-6 

• • 

2— 

39 

62-6 

43-7 

3-52 

0-61 

4— 

69 

65-4 

64-3 

3-34 

3-98 

6— 

41 

43-2 

46-2 

Oil 

059 

8— 

29 

31-2 

33-6 

015 

0'63 

10— 

22 

21-5 

22-4 

0-01 

001 

12— 

16 

14-2 

14-1 

0-23 

0'26 

14— 

7 

91 

8-6 

048 

0-30 

16— 

6 

5-6 

6-1 

006 

0-00 

18— 

3 

3-3 

2-9 

003 

000 

20— 

4 

1-9 

1-7 

2-32 

3-11 

22— 

0 


0-9 

100 

0-90 

24— 

0 

0-6 

0-6 

0-60 

0-50 

26— 

1 

0-3 

0-3 

1-63 

' 1-63 


241 

239-8 

240-9 

X*| = 13-38 

12-81 


Example (4) refers to the wages of certain women tailors previ¬ 
ously recorded in Table (11), p. 41. The data as given in the 
original suffered a disadvantage common to such statistics : at 
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either end the grouping diilered from that in the centre, two or three 
classes being lumped together owing to the smallness of frequency 
in each. The figures ran thus ;—Under 5s., 19 ; 5s. and under 6s., 
180 ; 6s. and under 7s., 384 ; . . . ; 23s. and under 24s., 64; 

24s. and under 253., 54; 25s. and under 30s., 122 ; SOs. and over, 
36. They were recast in the form shown in Table (46), suggested 
by an examination of the histogram, in order to make the fitting 
simpler. 

The first four moments calculated from this adapted table and 
referred to 12s. as origin are :— 

y\=0-556718. v'a=5-056373, y'3=16-70163, /4=123-7a91. 
When referred to the mean, 13‘113436, the last three become 
1/2=4-746438, 1/3=8-60179, i/4=95-6914; 


or, after making Sheppard’s adjustments, 

/ia=4-663105. /i3=8-60179, /ij=93-34741 
therefore, =0-729713, ^2=4-29291. /e=l-63. 

The curve is thus of Type VI., 

y=y„{x-a)'“/x’>. 

To calculate the constants, the nth moment about the origin is 
jipven by 

NfA\= f yx^dx 


1/cj 


~^dx 


0 (1— 


(0 ( 1—2 

g<n-n-fx-ijQ ' ' 


— z]dzl where 





Thus, putting n=0, 



and /*\=a«r(gi-g,-l-n)r(gi)/r(5»-n)r(9i-?,-l)] 

therefore, , =ar{q^-q^-2)T{q,)inq ^-1 )T(qy- g,-1) 

=o(gi—I)/{gi-g,-2). 

Also fi'«//iVi=ar(gi-g,-l-n)r(gi—n+l)/I’(gi~n)r(gi—g,-n) 

=a(gi—»)/(gj—g*—n—1). 
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Hence 2==a^{qi-l){q^-2)l{q^~qi-^){q^~q^-Z) 

M'3==“®(?i-l)(3i-2)(gi-3)/(ffi-g3-2)(g,-52-3)(?i-g2-4) 

/*'4=“"(ffi-l)(3i-2)(?i-3)(5i-4)/(?j-ff3-2)(^i-g2-3) 

(?i-ff2-4)(5,-g2-6). 

But these relations are precisely the same as those of Type I. with a 
in place of 6, —q^ in place of and gj in place of so that 
(l+Qj). (1—qj* are the roots of 

q"-rq+4r%+l)/[j3,(r+2)*+16(r+l)]=0 . . (14) 

where r=6()8,-/3,-l)/(6+3^,-2/3,) .... (15) 

Also yo=Na’'-*>--^r(Qj/r(q,-a.-l)r(q,+l), by (13) . (16) 

and a is given by 

/.,= a^(l-q,)(l+q,)/r2(r+l).(17) 


fj .2 being the second moment of the given distribution referred to 
its mean as origin. 

The distance of the mean from the origin is 

^'i=a(qi—l)/(qj—q,—2), 


and this fixes the origin, for the mean is known directly from the 
statistics. 


To get the mode, use the equation of the curve, putting ^=0 

dx 

and we have 


origin =mode—agi/(gj—gj). 


Combining this with 

origin =mean—o{g, — 1 )/(g,—g,—2) 


we have 


mean—mode=a(qi+q,)/(qi—qj)(qi—q,—2) . 



Applying these formulae to the case of the women tailors, 
r=—38-7698, gi=51-5269, g,=10-7671, a=2M1018, 


and the equation of the curve is 

y=yo(x-21-l)“Vx“'‘, 

where log =68-8254. 

Also the origin is at —41-9104, the mode at 11-4498, and the maxi¬ 
mum theoretical frequency is 2299. 


[* When Ms is positive (1 + g*) goea with the poaitive root of the quadretio, end 
vice verea. j 
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Table (46). Disteibution of Wages of certain 
Women Tailoes, Actual and Theoretical. 


Wagea. 

i 

Frequency. ' 

Wage®. 

Frequency. 

ActuaL 

Theoretical. 

Actual. 

TbcoreticaL 

la— 

5 

1 , 

19.a— 

523 

503 

33.— 

14 

62 

213.— 

262 

278 

6s.— 

664 

452 

23s.— 

118 

147 

7a— 

1243 

1332 

253.— 

64 

75 

9s.— 

2045 

2096 

27a— 

43 

38 

11a— 

2339 

2255 

293.— 

27 

19 

13s.— 

1815 

1898 

Sla- 

15 

9 

16s.— 

1432 

1353 

33s.— 

9 

5 

17a— 

854 

859 

4 e 

a a 

• . 

• 4 

• 4 

1 

• ■ 

• 4 

11,372 

11,372 


The theoretical and actual frequencies are compared in Tabic (46) 
and the curve is drawn with the histogram in fig. (38). 



FkK<38). 


Example (5) discusses the distribution of frequencies of specimens 
of Anemone nemoroea with different nombers of sepals, recorded by 
G. U. Yule (Biomeirika, vol. i., p. 307). 
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The first four moments referred to 6 as origin are 

y'j=0-o08, v\=l-012, ^'3=2-476, v\^9-l24:. 

Referred to the mean, 6'508, the last three become 

07539360, i.3=l-195905, i/,=5-459941. 

The contact, at one extremity certainly, being doubtful, Sheppard’s 
adjustments were not made in this case. Hence, 

^1=3-337259, jS2=9-60547G, «:=l-46. 

Since k does not differ greatly from unity an attempt was made to 
fit the observations with a Type V. curve, namely^ 

The nth moment about the origin is given by 
N/x' „=j yx^dx 

JH 

(since, p and y being positive, y vanishes at r=0 and at x=CO) 

(where 2 =y/x) 

=yoy’‘''’+'r(p-n-i). 

Thus N=yoy"'”‘'^r(p—1). 

And pf Jp.\^y=yl{p—n—l). 

Hence ^'i=y/(p—2) 

/i'a=yV(?>-2)(p-3) 

f^'3=yV(P-2)(p-3)(p-4). 

Referred to the mean as origin, the last two moments become 
Ma=y“/(P-2)=C?>-3), 

/^3=V/(P-2)S(p-3)(p-4), 

whence 

j9i=/xV/^=*3=16(p-3)/(p-4)2=[16(p-4)+16]/(p-4)M 

this gives a quadratic for (j>—4), one solution of which is 

p-4=[8+4V(4+A)]/ft, . . . (19) 

the positive root being taken in order to get a real y. 

Thus y*=(P-2)V[(P-3W . • • (20) 

and y„=Ny»‘-Vr(p-l) .... (21) 

Since tlie position of the origin is given by 

Origm=Mean—y/(p—2) . . . (22) 

Also the distance of the mode from the origin is y/p, so that sU 
the constants of the curve are readily determined. 

[* The sign of y it taken to be the same as that of 
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In our particular case, we get 

p=9-G43S40. y=17-10768, 

and the curve is 
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where log ^(,=9*38179. The origin is at 4*27 and the mode at 6*04. 
The greatest frequency is 620 approziiDately, and the frequency dis* 
tribution, calculating areas for the several groups as if they ranged 
between (4’6—6*6), (6‘6—6*6), etc., is shown alongside the observed 
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distribution in Table (47). The curve is plotted in fig. (39) from the 
ordinates which were calculated at the centre and extremities of 
each group so as to enable Simpson’s simple quadrature formula 
to be used to get the areas. 


Table (47). Distribution of Sepals of Anemone 
Nemorosa, observed and calculated. 


No. of 
SspalR. 

f 

Frequency. 

No. of 
Sepals. 

Frequency. 

Observed. 

t 

Calculated. 

Observed. 

Calculated. 

5 

34 

61 

9 

14 

22 

6 

676 

544 

10 

4 

6 

7 

276 

296 

11 

• • 

2 

8 

92 

81 

12 

4 

1 

9 ♦ 

# % 

% a 

a a 

1000 

1003 


[Examples have been given above of five out of the seven different tjT>c3 
of frequency curve that have been enumerated* For further examples of 
all the types and a complete account of the method reference should be 
made to Professor Ptarson^s memoirs, especially the following 

Roy. Soc. Phil Trans., vol. 186a, pp, 343-414 (1895), On Skew Variaiim 
in Homogeneous MaUrial; and a SuppUmentani Memoir in vol. 197 a, pp. 443- 
469 (1901). 

Biomeirika, vol. i., pp. 265 ei seq.. On the Systematic Fitting of Curves to 
Observations and Measurements, continued in vol. ii., pp. 1-23. Also vol. iv., 
pp. 169-212, which discusses various historical hypotheses made to generaliie 
the Gaussian Law, the basis of the symmetrical normal curve. 

A large number of highly interesting practical illustrations of Pearsonian 
curve fitting occur throughout the pages of Biometrika, while W. P. Elderton’a 
Frequency Curves and Correlaiion contains an admirably concise treatment of 
the theory, with appLcations to meet more particularly the actuarial point 
of view. 

It should be stated that rival curves and methods have been proposed as 
suitable for fitting certain types of frequency distribution, some of which have 
scarcely received the attention and the trial they deserve. Among the most 
interesting are those developed by Professor Edgeworth ; for some account of 
his voluminous work upon the subject the reader may refer to several memoirs 
in the Jourruzl of the Royal Statistical Society, beginning December 1898 
(the Method of Translation), among which the following are important as 
giving more recent results of his researches :— 

Vol. Ixix. (1906), The Generalized Law of Error or Law of Great Numbers. 

Vol. Ixxvii. (1914), On the Use of Analytical Geometry to Represent Certain 
Kinds of Statistics. 

Vol. Izxix. (1916), On the Mathematical Representations of Statistical Data; 
continued in vol. Ixxx. (1917). 

Two memoirs may be cit^ as of particular interest—those of May 1917 
and March 1918—because they reply to criticism and draw a compaxisoo from 
their author's point of view between his curv^ and those of Professor Pearson.] 






CHAPTEK XVIII 


THE NORMAL CURVE OP ERROR 

Let us return for a moment to the general statement on p. 143, 
that ‘ whenever we have n similar but independent events happen¬ 
ing in which the probability of success for each is “p, the different 
resulting possibilities as to success are given by the successive 
terms in (5+/)", namely, ^ 

• • • +/“• 

1 • it 

and their correspondent probabilities by the successive terms in 
(p+g)", namely, 

p"+np"~‘g^+— ^ • • • +9*•’ 

1 • it 

When we come to try and apply this theory directly to cases 
other than those of random sampling in artificial experiments with 
coins, dice, etc., we are faced at once with difficulties because of 
the limiting character of the assumption on which the theory rests, 
namely, that all the events ore to be similar^nd indepi^ent. The 
similarity demanded is of the same radical type as that existing 
when we throw the same die or spin the same coin twice running, 
and the test for it is that p, the chance of success, is tc be the same 
for every individual event. The independence is to be such that 
no single event and no combination of events is to have any influence 
upon any of the rest. 

Now for most classes of events it is impossible to assign any 
a priori value to p at all, still less can we be sure that p does not 
change from one event to the next. For example, the chance of 
death for soldiers in war-time varies from regiment to regiment 
according to where they happen to be located; for the same regi¬ 
ment it varies from battalion to battalion according to whether 
they are in the trenches or behind the lines ; and from individual 

to individual according to innumerable little socidents of time, place, 

«si 
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and condition. Al^o, where the sheUs burst thickest, p increases 

for any soldier there, but it increases also for his neighbour. Thus 

the c%’ents in such a case are not similar, neither are they inde¬ 
pendent. 

Moreover, as it stands, the theory cannot be applied to any 
distnbution in which the character observed is capable of continu¬ 
ous variation. This difficulty, however, has been overcome, as we 
have seen, by replacing the histogram representative of the binomial 
y a continuous curve which at the same time serves to describe 
the discontinuous series to a high degree of accuracy. 

To illustrate how close this 
description can be, even when n 
is comparatively small, we will 
fit with its appropriate normal 
curve the symmetrical binomial 
polygon formed by joining up 
the summits of the ordinates 
representing successive terms of 
the series 

erected at unit distance apart. 
The total area bounded by the polygon, the extreme ordinates, 
and the axis of x is practically 

=(yn+yi+y2+ ■ . . +y'i+y'i-f . . .)x(i) 

=8um of the given ordinates 
=1024. 



The equation of the normal curve is 

^here cr2 _ =llxJxJ=2-75, 

Yo=N/V^-ff=1024/-v/(5-5jr). 
Hence, taking logs, we have 

log3/=logYo-^^logioC 

=2-3915437—x2(0-0789626). 


It IS easy from this equation to calculate the normal curve ordinates 
corresponding to x=0, 1, 2, 3, 4, 5, and the results, compared with 
the polygon ordinates, are as follows 
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X 

Ordinate of Polygon. 

Normal Curve Ordinate. 

0 

252 

246-3 

±1 

210 

20.5-4 

±2 

120 

1190 

±3 

45 

48-0 

±4 

10 

13-4 

±5 

1 

' 2-6 


Now although the circumstances in which the series 


may be taken to represent the frequency distribution resulting 
from a particular kind of experiment were so stringently defined, 
there is no reason why the normal curve itself to wliich the theory 
led should be subjected to precisely the same limitations. After 
all, the real and only justification for choosing one curve rather 
than another to fit any given observations is that it does succeed 
in fitting them better. But when the further question is asked 
why the normal curve should succeed in describing some results 
so well, we must not be tempted by analogy to rush to the con¬ 
clusion that the causes at work are nccessarfiy independent, and 
equal, and so on. In short, the theoretical justification and the 
empirical use of the normal curve are two quite different matters. 

Experience shows that the normal curve suffices to fit certain 
types of distribution, besides those wliich arise in tossing coins and 
in similar experiments, with remarkable accuracy ; among these 
may be noted :— 

1. Certain biological statistics ; for instance, the proportions of 
male to female births taken over a series of years for a large com¬ 
munity such as the population of a country; also the propor¬ 
tions of different types of plante and animals resulting from cross¬ 
fertilization. 

2. Certain arUhropometrical, particularly craniometrical and allied 

statistxcs, such as the height, weight, lengths of various bones, skull 

measuremonte, etc., of a large group of persons, and the agreement 

is the closer if the group be reasonably homogeneous, i.e. composed 

of individuals of the same nationality and sex between the same 

narrow age limits, etc.; also measurements of a similar character 
in animalB and plants. 

3. Errors of observation in experimented work ; for example. 
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several measurements of the same quantity—length, weight, speed, 
temperature, or whatever it be—will contain errors of this kind 
which are equally liable to be above or below the true value. 

4. The marks of shots upon a given target, assuming that the 
shots are equally liable to err in any given direction. This is an 
interesting case of the normal law in two dimensions, for the north 
and south line and the east and west line through the centre of 
the target may both be regarded as axes of normal curves of error.* 

5. Certain sociological statistics of a comparatively stationary char¬ 
acter : for example, rates of birth, marriage, or death at neighbour¬ 
ing times or like places ; also the wages (and possibly the output 
if it could be satisfactorily measured) of large numbers of workers 
engaged in the same occupation under the same general conditions. 

6. Any statistics or guantities that are individually compounded oj 
a large number of elements, mostly independent of one another, which 
themselves vary between limits not very widely divergent, and none 
of which exert a preponderating influence upon their resultant 
statistic. The latter may be simply the sum of its elements, or. 
more generally, it may be any function of the elements which, to 
the first degree of approximation, can be expressed in linear form. 

Now it would be a difficult matter in most of these cases to satisfy 
ourselves as to the fulfilment or non-fulfilment of conditions like 
those on which the binomial distribution rests. It is not easy 


indeed to visualize them perfectly, except in artificial experiments 
where they are largely under control. If anything, the chances 
seem almost hopelessly against their fulfilment in ordinary life, 
so closely must we hedge round our sample to keep out unequal 
influences. For example, to use a frequently quoted illustration, 
if p measures the chance of death for an individual, the death rate 
varies, as we know, considerably from place to place according to 
the age and sex constitution of the population ; it is influenced by 
differences in class, and occupation, and manner of life; it is 
altered from time to time, violently by the ravages of war or disease, 
more gradually by improvement in general sanitation, housing 
conditions, etc. We should only expect to get the binomial distn- 
bution (and consequently the normal law if it depended upon t e 

[* Sir John Herschel published in the Edinburgh Review (IWO) an ® 
proof of the normal law from a consideration of this problem. Taking 
the expreasioa of the law for one dimension and ^(x’ + y*) for 
the independence of errors in perpendicular directions leads to the lune 
equation ^(x* + y’) = 0(x*) x the solution of which is of the 

g - It should be added that the assumptions underlying the proof 

not entirely above oritioism. ] 
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same poBtulatee) exactly verified if we were dealing with the same 
stationary population existing under the same stable conditions 
over a long period of time ; moreover, since p is to be identical for 
each individual event in the ideal case, it would be further necessary 
that every family and every individual in our population should 
also remain in the same stationary and stable state. This is mani¬ 
festly impossible, especially after the industrial revolution which 
the advent of machine power created. 

These considerations suggest the interesting question whether the 
various types of statistics we have enumerated, as being approxi¬ 
mately subject to the normal law, could not, if we knew more 
about them, really all be included under heading number (6), repre¬ 
senting a further development from the binomial theory and an 
enlargement of the field in which it holds good. 

In an earlier chapter, when we were discussing the connection 
between marriage rate and prices, we showed how it was possible 
by a method of averaging to differentiate between long-time and 
short-time effects. The more transient fluctuations, only super¬ 
ficial in character, were removed and the real nature of any per- 
manent change in the figures was revealed. In much the same 
way, when we have a group of statistics which do not perhaps fit 
a normal curve of error at all closely, it may be possible by random 
averaging to get rid of some of the fluctuations which cause the 
badness of fit and to obtain a new group of statistics which more 
nearly obey the normal law. Averaging, that is to say, tends 
to smooth away the rough outstanding abnormalities ; and we shall 
presently show that if two variables, X,, Xj, which are independent, 
obey the normal law, any linear function of the variables 
(uJiXi-fti/jXj), obeys the same law. This may throw some light 
on Class (6) where each statistic represents a compound, that is, 
in a broad sense, a kind of an average of a largo number of elements 
which partially neutralize one another’s influence, or rub the comers 
off one another, so to speak, since no single element is, by hypothesis, 
to exert an overwhelming influence upon the compound itself. 

But although the normal curve does serve to describe a consider¬ 
able number of frequency distributions within reasonable limits, 
there are many more cases in which it fails: for example, the 
greater part of those bearing on economic matters; also statistics 
relating to the incidence of disease and degree of fertility are, as 
a rule, very markedly skew. Hence arose the necessity for* an 
extension from the symmetrical normal to some kind of skew 
Tariartion cuttm to fit Buch distributioDB. 
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The normal curve, however, has an importance of its own to 
which we must now draw special attention. It ^ the foundation 
of the theory of errors and provides us with an invaluable method 
of estimating the importance of one efror in comparison with 
another, or of determining the probability that an error shall lie 
between state^_lim^. Upon it we depend for several most 
important approximations which are in constant use. 

The term ‘ error ’ is used here in the sense that if we take the 
mean of a number of observations, the deviation of any one of 
them from the moan may be termed its error. When such devia¬ 
tions can be satisfactorily fitted, that is, within the limits of random 
sampling, by means of a normal curve, they are said to be subject 

to the normal law of error. 
This law is expressed, as we 
have seen, by the equation 

__ ^ 

y --=-e , 

where y . measures the fre- 
quoncy with which an observed 
organ or character deviates from 
the mean by an amount lying between x and (z+8x) in a large 
population, i.e. y . Sx registers the frequency of an error of size x 
to (x-f-8x), and N and a are constants dependent upon the particular 
application of the law. 

The probability curve or normal curve of error. As a guide to the 
drawing of the above curve it may be worth while plotting 

y—c“**- 

Tliis is readily done by writing the equation in the form 

—x2=log,y. 

Giving now to y the values 0, O-l, 0-2, etc., we can find values of 
y shown in Table (48), and, by means of a square root table, 
X is then determined. 

Table (48). Corresponding Valdes of x and y to plot y=er^- 


V 

log«y 

X 

y 

log«J/ 

X 

1 

0 

— CO 

±00 

0-6 

-0-5108 

±0-71 


-2-3026 

±1-52 

0-7 

-0-3567 

±0-60 


-l-COOf) 

±1-27 

0-8 

-0-2232 

±0-47 


-1-2040 

±1-10 

09 

-0-1054 

±0-32 

0-4 

-0-9163 

±0-96 

I-O 

0 

0 

0-6 

-0-6932 

±0-83 

% • 

« • 

« • 
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This enables us to plot the graph as shown in fig. (40). Since 
log, 1=0, and the logarithm of any number greater than 1 is 
positive and thus cannot be equal to —x-, it follows that y cannot bo 
greater than 1. Moreover y cannot be less than 0, for the logarithm 
of a negative quantity is meaningless, but, as y approaches 0, 
X approaches oo. 

Also the curve is sjuumetrical about OY because for any possible 
value of y there are two values of x, equal and opposite. 

Returning now to the curve 

j/= N 
■\/27r . a 

it must be of the same general shape as y=e-** because the two 
only differ in their constants. It is clearly symmetrical, for 



instance, about the axis of y, because, in this case also, to any value 
of y there are two values of x equal and opposite. Moreover it 
tails off to the right and left from OY, the axis of x being an 
asymptote, for as x tends to ±co, y tends to zero as before. 

*=0, j/=N/V^.a, 

giving the point B, fig. (41), where the curve cuts the axis of y% 
This is evidently the highest point on the curve, for 

^ V^TT . <7* 

wid this vanishes when 

Again, ^ 1+?!Y 

^ VZttu* \ a*j 

which vuiishes when x=±a, and at these two points, H. H', we 
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therefore have ‘ points of inflexion ’ where the bend of the curve 
changes its direction. 

The axis of y about which there is symmetry evidently locates 
the mean error, in this case zero; in fact the mean and mode 
coincide, so that the mean or zero error is also the one which most 
frequently occurs, and any two other errors which are equal in 
magnitude but above and below the mean respectively occur with 
equal frequency : i.e. the frequency of positive errors is balanced 
by the equal frequency of negative errors on the other side of the 
mean, making the median error likewise zero. 


Again, the area j ydx measures the frequency of errors lying 

r+* 

between *j and above the mean ; / ydx registers the frequency 



Fio. (41). 


of errors between 0 and x, or of deviations up to this magnitude 
on either side of the mean ; and. in particular, for all errors 

.+eo 

the total frequency = / ydx 

J^CO 

‘^27r . (tJ-co 

N _ 

—- 7 =—(V 27 r . <t) (as on p. 206) 

V 27r • o 

=N. 

This enables us, by means of the fundamental definition, at onoe 
to write down the probability of errors between any stated limits 
and explains the origin of the name, the probability curve, which 
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is sometimes given to the equation. Thus we have the probnbiJiiy 
oj an error between and +X 2 

_frequency of errors between the given limits 

frequency of all errors 







incidentally, the probability of an error between x and 

(X+&) 

N 



Fio. (4S). 


Geometrically, the area represented by the shaded portion of 
fig. (42) measures the frequency of errors between +Xj and H-Xj, 
while the complete area between the curve and axis X'OX measures 
the total frequency, so that the probability of an error between 
and -j-x, is measured by the proportion which the area of the 
shaded portion bears to the whole area. 

If in the above expression (1) we put x/a=^, so that ^=a, 
it becomes 



uiiioh b known as the probability integral, and being the 
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p^alues of ^ which correspond to the values Xi and of x. But 
this integral measures the area of the shaded portion of the curve 



shown in fig. (43), which is really the normal curve over again, but 
drawn on a different scale, namely, with the ordinates reduced in 
the ratio N : a and with the standard deviation a taken as the 
unit of measurement for X, for^=l, 2, 3 . . . when x=a, 2a, 3a, . . . 
This has the effect of making the total area unity and the area 
given by 

. . . . (3) bis 

V 27r.Vi 

now directly measures the probability of an error between and 

Tables have been prepared 
(see pp. 284, 285) which enable 
ua to write down the value of 
this integral for different values 
of and ^2 between certain 
limits (see Appendix, Note 10). 

Let us take an example to 
show how the curve may be 
used, and we choose one leading 
to a binomial distribution, so 
giving an expression for the 
probability by first principles, 
in order to compare the two methods. 





I J-f* 

VSw- 

Fio. (43). 




Example .—Suppose we toss simultaneously 100 coins, and sup¬ 
pose the chance of success, say ‘ heads,’ is the same for each coin 
and equal to 1/2. In that case, according to the binomial theory. 

the probability of 100 heads =(1/2)^^’®, 

99 heads and 1 tail =»®®Ci(l/2)W(l/2), 

» „ 98 heads and 2 taiis=i*»C5(l/2)38{l/2)“,andsoon. 

The most probable number of heads=np=(100)(l/2)=60. This 
does not mean, as explained before, that if we perform the 
experiment once we are sure on that one occasion to get exactly 
60 heads and 50 tails, but that if we go on repeating the experiment 
we shall in the long run get 50 heads and 60 tails turning up more 
often than any other combination. 

I^t it be required to find the probability of getting at least 66 
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beads, that is, we want the probability of getting 55 heads or 
more, and this is given by 

=S=[‘“C=«+‘'”C..+ ■ ■ ■ +‘“C,+ 1]. 


a sum not very readily calculated if we have to go at it in a straight¬ 
forward manner. 

Now let us turn to the curve of error method. The standard 
deviation for the distribution is given by 

a='\/npq=y/{lOOy. \ x i)=5. 

Since the mean number of heads to be expected if the experiment 
is repeated a considerable number of times=50, we want to find 
the probability of an error equal to or greater than 5, i.e. an error 
lying between a and +CO, because < 7 = 6 . 

But the probability of an error between and 

-by (3) 6 m. 


Hence the required probability 



=0'15866, by the probability integral tables. 

In other words, if we repeated the experiment 100 times, wo might 
expect 66 or more heads about 16 times. 

We can now show that if X,, Xj are two uncorrdated variables 
obeying the normal law, then (MJjXi+WgXj) will obey the same law. 

Suppose Xj, Xj are observed deviations from the mean values 
Xi, Xj in one particular record, Oj, being the respective S.D.’s. 

Let X=W|X|-{-u> 2 X,, and let x be the deviation in X corre¬ 
sponding to deviations Xj, x, in the given variables. 

Thus X+x=«)i(X,-fXi)-fu) 2 (X,-fXj) 

=(i«iXi -b w,X,) -b (WjX, -b WjX,). 

Therefore, x=WjXi-\-w^^. 

But the same error x may be obtained by giving Xj, Xj many different 
values provided their weighted sum is unaltered. Let us first 
keep X| constant, so that the corresponding value of x^ required 

Q 
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to produce an error lying between x and (a;-|-Sar), where Sx is small, 
must be such that 

X<W^XJ-{-W2X2<X-{-Sx, 

i.e. x—WiXi<W 2 X 2 <x—WiXi-^Sx, 

i.e. X 2 lies between {x—WiXi)/w 2 and (x—and the 
probability for this 

._=1_g-(*-wi*i)a/2<^V'’=2 by (2). 

*^2 V 277 . Oj 

Now this is in a form which only involves 8x, x, and Xj, and we 
get the total probability for an error l 5 dng between x and (x-|-8x) 
by giving all possible values to the error x,. 

But the probability for x^ itself to lie between x^ and (Xi+Sxj) 


■\/27ra|-/*i 






by (2), 


V27r(T| 

and the probability for this to concur with a suitable Xj to produce 
an error in the weighted sum lying between x and (x+Sx), on the 
assumption that and are independent, is therefore 


— e~ 

J7iV27T 


x3,/2«r»i 


8x 


_1L“^2 Og 






I 


Sx 




Hence the total probability for an error l 3 dng between x and (x+8x) 
is obtained by integrating this result, that is, summing all possible 
probabilities, between Xjs =—00 and Xi=+(X>. This gives 

8x 


U>2 • 2i7TOia2J~fX> 

Sx /■+• -*=1 


,+09 ** - 

J^ad 


2tPix ^ 


W 2 .27rai0'2 

where 

8x 


•/-oo 


W 2 • 27T<Ti02 
8x 

Wg . 27ro'ja2 


/-00 
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where 1=1 \~I 

^ V 2 V 


U^lCTi 


ajOTjUJg a-iW-p 

Sar 


• 277aia. 


e 


*1 / o 

• V ^ ’ cr^cr^tgi 


8x 

= —-- 1 

V^277 . a 


which proves that the error x obeys the Dormal law with 

S.D.= -v/(wVi+wV®,) . . . (6) 

The above principle is readily extended, for if 
^=WjXf-\-W2K2~\~ • • ■ +“’nX„, 

Xi, Xj, . . . X„ being independent variables obeying the norma) 
law, then X also obeys the normal law and its 

SJ).= V(wVi+wVa+...+wVj . . (6) 

In discussing the results of random sampling we worked upon 
the principle that, given a number of sample observations of any 
statistical constant, a mean or a percentage or a coefficient of 
regression or anything else, an error or deviation as large as a. 
the standard deviation, from the true value for the whole population 
niigbt quite likely occur, but that an error exceeding 3c7 would be 
unhkely, and we explained that, as a result of convention, the 
probable error, equal to |a roughly, was largely used in place of a 
by many writers. We have now to examine the basis of this 
principle, and the first point to notice is that it only strictly applies 
to a normal distribution. 


To find the probability of an error lying between —a and -j-o tn a 
normal distribution. 


The required probability =—^ - f 

\r27T . trJ-o 


1 f*' « 

T^j ^ t-^'*d^ (where ie=af) 


^277 
2 


V^- 

=0*6827, by means of the tables. 

This then is the probability that the error in a given sample shall 
not exceed the S.D., a. The probability that the error shall exceed 
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a accorcliugly (1—0-68)=0'32. It therefore appears that the 
odds against an error exceeding this amount are 68 to 32, or about 


2 to 1. 


The probability of an error between —2o and +2a 

1 /+2 
V2nJ-2 
=0-9545, 

and the probability of an error outside these limits=0-0465. 

Hence the odds against an error exceeding 2a are about 21 to 1. 

The probability of an error between —Za and +3a 


1 ^+3 

=-7=/ 

\/2w/-3 


y/2rT- 

=0-9973. 

Hence the odds against an error exceeding 3a are about 370 to !• 



That these results are reasonable can 
of the curve of error 


be seen by an examination 


y= —H— 

^ VZrr.a 


the graph of which is drawn, fig. (44), in the particular ctwe when 
ff=6, N=100. The maximum ordinate is thus=20/V2jr=7’98. 
and the curve becomes 


y=7-98e-*^/“. 


When x= o= 5, y=(7-98)(0-606)=4-84, 
„ x=2a=10, y=(7-98)(0-135)=108, 

„ .T=3a=15, y=(7-98)(0-011)=0-09. 


PiNj in the figure. 






9f 

U 


99 
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There is a point of inflexion where the curve changes its 
direction at Pj, also at the companion point P', on the other side 
of OB. 

The areas ONjPjB, ONjPgB, ON 3 P 3 B, P 3 N 3 X represent respec¬ 
tively the frequencies of errors 0 to < 7 . 0 to 2a. 0 to 3a. 3a and over 
(considering only errors on the positive side, that is, deviations 
above the mean), and the figure shows how very improbable is a 
deviation from the mean exceeding 3a, for the area between the 
curve and axis beyond this limit is negligible. Put in another 
way, a range of 6a should include practically all the observations 
in the sample. 

The probable error has in the past received various namo.s. such 
as mean error, median error, quartile deviation, and although some 
of these may seem more applicable and less confusing than the 
name to which it has settled down, there is perhaps not sufficient 
excuse for unsettling it again, even had we the power to do so, 
by attempting a return to one of these old names. 

If its magnitude be r it is defined to be such that the chance 
of an error falling within the limits —r and +r is exactly equal to 
the chance of an error falling outside these limits, in fact it is an 
even chance whether a particular error falls within these limits 
or not. 

Since area measures frequency it follows that the ordinates 
drawn through the probable errors divide both halves of the normal 
curve (above and below the mean) into two equal parts ; the one 
above the mean, QR, is shown in fig. (44), and consequently the 
area OBQR=the area QRX, in that figure. These ordinates there¬ 
fore coincide with the quartiles, and the probable error is precisely 
the same measure as the quartile deviation. 

The magnitude of the error is readily calculated from the proba¬ 
bility integral table, for, by definition, we have 


e-^n^'^dx 


Hence 


1 r 

V^27r . (T'^r 

(where x=ui). 

V27r/-r/^ 

1 


and the probability integral table at once gives 

r=s0‘6745a=approximately fa 
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Thus we have the frequently quoted rule that the 
quartile deviation=‘(standard deviation), 
or probable error=:0'6745 (S.D.) 

The probability of an error lying between —3r and +3r 


1 

'y/2-n . <jJ-3r 





=--f 


$( 0 - 6745 ) 


t 

e-P/ 2 j^ (where x=<j^ as before) 


=0-9570. 


Thus the odds against a deviation exceeding three times the probable 
error occurring in a single trial are about 22 to 1, or much the same 
as the odds against a deviation exceeding twice the S.D. 

There remains one other standard of measurement in connection 
with errors which is at least deserving of mention, namely, what 
we have previously called the mean deviation, which may be denoted 
by 7). It is simply the mean of all errors without regard to sign; 
thus, since yZx measures the frequency of an error lying between 
X and 


t =2 xydxj 2 ^ydx 

=xe-^'^’^dxj p e-^^>~''^dx 

(where x=ai) 

^■\/2<7l°°te~‘^dtI (where 



=0-7979ty, 


hence the rough rule that the 

mean deviation=-(standard deviation) . • • (8) 

It must be borne in mind that all the above rules relating to 
errors—using the term as sjmonymous with the deviations of single 
or sample observations from the mean of a considerable number of 
the same character—strictly apply, as we said before, to the normal 
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curve of error and are only approximately true for other distribu¬ 
tions, the approximation being the closer the nearer they approach 
to the normal form and the larger the number of observations 
involved. They have been tested in some cases in earlier chapters 
(see, for example, Chapter VII.), and the results obtained, even 
with very skew distributions of comparatively small numbers of 
observations, are at all events close enough to suggest the utility 
of the rules in more favourable cases. 

The effect of variability on errors. The probability of an error 
Ijring between 0 and t 

1 /•' 

=-7^ I 

V 27t . a-'o 

Put x^x’lm, and this becomes 

V 27t . a JO m 



O Ns N, 

-- .3l ..... 

Flo. (4&). 



Thus, if the variability be increased m-fold the range of error (of 
equal probability) is increased m-fold, so that if we have two sets 
of N observations, with the variability of one set double that of 
the other, the range of error also in the one set is double that which 
is equally likely to occur in the other. This is brought out fairly 
clearly in fig. (45), which is the result of plotting the curve 
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in the two cases. The variability a of curve (1) is double that 
of curve (2) ; if then we measure along OX in the figure 

0Ni=-20N2=2t, 

the area BjONiPj will be equal to the area B^ONgP*, showing that 
the probability for an error between 0 and 2t in the one case is equal 
to the probability for an error between 0 and t in the other case. 


[James BerDoiilli (1654-170.5), the eldest of tliree remarkable brothers, 
showed how the binomial theorem could be used to estimate the probability 
that the ratio of the number of successes to the number of failures under 
defined conditions should lie between set limits, where success means that a 
certain event happens dUil failure means that it fails to happen. 

It was fJauss who first actually published a proof (1809) of the equation of the 
normal curve, although Laplace had suggested as early as 1783 the utility 
of a probability integral table, je - Gauss's proof depended upon certain 
axioms which cannot be established and arc not necessarily true, one of which 
was that * errors above and below the moan are equally probable.’ Laplace 
and Poisson improved upon Gauss and succeeded without assuming this 
axiom, but with the aid of theorems due to Euler and Stirling, in developing 
the^continuous probability integral from the discontinuous binomial series. 

further extensions of the normal curve applicable to skew distributions 
have been worked out by other writers, such as Galton and McAlister, Fcchner, 
Lipps, Werner, Charlicr, Kaptcyn, and finally by Edgeworth, who has contri¬ 
buted materially to the development of the idea of ‘ the Law of Great 
Numbers.’ Karl Pearson approaching the subject of skew variation from 
the same point but by an original route, has discovered a complete system of 
curves suitable for fitting almost all kinds of distributions in homogeneous 
material, especially such as are met with in the biological world. 

(See Tod hunter, History of Probability. 

Edgeworth, Law of Error in the Encyclopaedia Britannica (10th edition). 

Pearson, Das Fehhrgesctz und seine Verallgemeinerungen durch Fechner 
UTui Pearson: A Rejoinder; Biometrika, voL iy., pp. 169-212).] 



CHAPTER XTX 


FREQUENCY SURFACE FOR TWO CORRELATED VARIABLES 

It may serve at this stage to ^Niclcn the outlook upon the subject 
of correlation for those who are able to follow it up on mathe¬ 
matical lines if we briefly consider the algebraical expression for 
the combined distribution of two variables. 

Let the variables be X,, X 2 . They may be absolutely independent 
or they may be related in some way, but in cither case we shall 
assume it possible to set up a one-to-one correspondence between 
them : thus, Xj might represent the marriage rate and Xj the 
index number for wholesale prices, and we might always pair 
together the X, and the Xj %vhich refer to the same year, as in the 
correlation example in a previous chapter ; moreover this pairing 
naight still be effected even if there were really no other connection 
at all between X, and Xj. 

If then Xi, X 2 typify the deviations of Xj, X 2 from their respective 
means (the means in the above case being derived by averaging 
the flgures for a number of years), it is possible to write do^vn an 
expression of the form 

y=F(a:„ x^) 

for determining the probability of deviations between a;j and 
(r^-l-Sarj), and (Zj+Sarj), occurring simultaneously (in the same 
year, in the above case) ; or, to put the same thing in another way, 
ySxjSxj would represent the proportional frequency with which 
such deviations might be expected to occur together in a large 
number of observations. 

The frequency curve y=f(x), where yBx denotes the frequency 
with which a variable with deviation lying between x and (x-J-Sx) 
from its mean value is observed in a given distribution, was repre¬ 
sented by plotting corresponding pairs of values of x and y as 
points in a plane. In the expression y =F(Xi, x^), however, we have 
three variables to consider, x^ and x^, and y which measures the 
frequency of the simultaneous appearance of and x^. Such a 
trio may geometrically be represented by a point P (Xj, Xj, y) in 



250 


STATISTICS 



Fia. (46). 


space of three dimensions, for (xj, can first be located as a point 
in a fixed piano and a height y may then be measured above this 
plane as in fig. (46). Clearly as Xj and x^ vary, y also varies, and 
consequently the point P moves about in space, but it moves always 
in obedience to the relation 

y=F(xi, Xa). 

This relation is called the equation of the surface along which 

P travels, showing that it holds good for 
the co-ordinates (Xj, Xj, y) of any position 
which the point can take up on that surface. 
It is convenient, however, to use the notation 

z=F(x, y) 

in preference to 3 (=F(xi, x^) for the ‘ fre¬ 
quency sxirface,’ because OX, OY are nearly 
always taken to represent the axes of refer¬ 
ence in space of two dimensions (i.e. in a plane), and by a natural 
extension OX, OY, OZ are taken to represent the axes of reference 
in space of tliree dimensions, fig. (47). 

We proceed to discuss the frequency surface for two variables, 
and we shall start with the comparatively simple case when the 
variables are completely independent. 

Frequency surface skounng distribution of 
two completely independent variables each 
subject to the normal law. 

Let X, Y be the variables, and let x, y de¬ 
note dc_vi^ions from their means X, Y, the 
point (X, Y) being taken as origin of co-ordi¬ 
nates and the usual notation being adopted. 

Thus the probabilityof a deviation betweenx and (x+Sx) occurring 



Fio. (47). 


\/2 


77 . a 


and the probability of a deviation between y and (y+Sy) occurring 

Therefore fche probability of such deviatiozis occurriDg together 
since the variables are supposed completely independent 

Jf c - 

Wijr . a, /\v'2ir . (7. 








27ra^ 
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Hence the frequency with which such pairs of deviations are 
observed together if n be the total number of observations 

=nl27TaafJv • ^ ^ oxSy. 

Denoting this by zSxSy, we get for the required frequency surface. 

2 = 71 / 2770 * 0^.6 . . . ( 1 ) 

If we give y some particular value, t/,, we find from the above 
equation that the law of frequency for the corresponding x is 


27T<J^ 


_ r ” fi ." *vig 

l_27r<7*aj, J 


n 


_ '*1 - - * 2 / 2 < r ^ 

V2,r . o* 

where ti, has been written in place of 

\V27r.a„ / 


fT 


But this is evidently a normal curve in the plane XiOZ,, bavin 
the same mean, X, and the same S.D., o*, whatever be the value 
of Vv 

Hence all arrays of X are similar, having the same mean and the 
same standard deviation, and this, by symmetry, also applies to 
all arrays of y. 

Now put 2 equal to some constant, k, in equation (1), so that 


... 


n 


Since the left-hand side of this equation is constant for different 
values of {x, y), it follows that the right-hand side is also constant 
and hence 


=c. 

<7,2 a * 


( 2 ) 


where c is a constant. 

We conclude that the values of x and y which <jan occur together 
^th a given frequency, k, are such that the point (*, y) always lies 
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somewhere on the ellipse (2) in the plane z~k, fig. (48); e.g. values 
in the neighbourhood of and occur with the same frequency as 
values in the neighbourhood of and 0, because in the figure the 
points (Xj, y^, k) and (x^. 0> both lie on the ellipse defined by 



The different ellipses which can be obtained by varying the 
frequency, and consequently varying c, are clearly concentric, 
similar, and similarly situated if they are orthogonally projected 
on to the plane 2 = 0 , for the effect of such projection is that any 


Zi 



point (x, y, z) drops down on to the point (x, y, 0) which stands 
immediately below it in the plane XOY. 

The general shape of the surface can be gathered from fig. (^8) 
where the ellipse in the plane z=k, and the normal curves in the 
planes x=0, y=0, and y—y^ have been drawn. 

It will also be noted that if the scales of x and y are altered by 

writing —=x' and —= 2 /', so that uni t change in each may be the 

a* Cy 

same, the ellipse (2) becomes a circle 

a:'2+y'2=c. 

This change of scales is equivalent geometrically to projecting 
orthogonally the ellipse into a circle; of course the planes of pro¬ 
jection are not the same as in the previous orthogonal projection 
mentioned. 
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Frequency surface for two correlated variables. Let the variables 
be X and Y, and let us work as before with their deviations x and y, 
whichis equivalent to taking the mean point (X, Y)of all the observa¬ 
tions as origin. 

Now the line of regression giving the best y, or the y of greatest 
frequency, corresponding to any x is 

Cy 

y=r-!lx, 

with the usual notation, r being the coefBcicnt of correlation 
between X and Y. 

Hence the error made in estimating any y from this equation 
instead of taking the y given by observation is 

7]=y (observed) —y (estimated) 

=y—r^x. [See fig. (49).] 

Thus, corresponding to every pair of observations (x, y) there is 
an 7}, and the same ij will be repeated 
just as often as the same pair of 
observations (x, y) is repeated. 

Therefore the frequency distribu¬ 
tion of (x, 1 )) must exactly correspond 
bo that of (x, y). 

Further, the correlation of the 

... , . - Fro. (49', 

variables x and ij is zero, for posi- 

tive and negative errors rj are equally likely to occur for different 
values of x; in fact, this coefficient of correlation is £{XT))/juja/i7^, and 

SixT}) =i:j^x^y-r^x^ J 

-2:(xy)-f .^.r(x») 



a*a„ (7, 


^np—np 

= 0 . 

Assuming then that the variables x and ij are quite independent, 
the probability of them occurring together is readily written down, 
for it is simply the product of their separate probabilities. 
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But the probability of a deviation between x and (a:+8a;) occur* 
ring, if we consider this variable alone, is 


Sx 


2«f,J 




and the probability of a deviation between ij and occurring, 

if we consider this variable alone, is 


V*' 






Hence the probability of a combined occurrence of such deviation? 


/\\/27ro_ / 


SxSt? - 


27r<7-a, 


_ 8 . 8 , 




_ 8x8, 


2770 - 0 , 


But 


na*= 




=2:{y‘)-2r. Zi . 2:[xy)+r-^-lz{7»’) 

e=72o^2_2r . . TlCT^^r+r^— . 7IO(,* 

o, o,® 

=no/(l—r2). 

Similarly, 

where ^ is the error made in estimating x from x=f—y| 

o, 

OV® <7 ^ 

<r,* a* 


Of OfO, 


Thus 










































lilt. 
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But just as 


1 

V277G^ 


represents exactly the same normal cur to as 

V27Ta, 



shifted through a distance a along 
the axis of a:, fig. (50), so we con* 
elude that the curve (4) in x and 
z, in the plane y=yi, is exactly the 
same as the normal curve 


O « 
Fio. (& 0 }. 


z=-t—e 


i_£__ 


shifted through a distance ty^ along an axis parallel to OX. In fact 

<Jy 

(4) represents a normal distribution for *, the mean, corresponding 

to greatest frequency when g= - being determined by the 

intersection with the surface (3) of the planes 

X y 

O’* Op _ 

and the standard deviation being — r^, which wc note is 

independent of y^, fig. (51). To put the same thing in another 
way, the array of x’s corresponding to a particular value yi of y 

have a mean deviating from X by r— . yi, and a standard deviation 
UgVl—r^. 

In particular, when y—Q, z=fj.e a normal distribution 

for X, the mean, corresponding to greatest frequency with z=/t. 
being determined by the intersection with the surface (3) of the 

planes y=0, — =r~-, and the standard deviation being 

Og <Ty 

as before. 

Similarly, when x=Xj, we get as in (4) a normal distributdon for y, 

Z=f^ , 

the mean, corresponding to greatest frequency when being 

determined by the intersection with the surface (3) of the planes 

y X 

*=«i, ^=r—. 
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ftnd the standard deviation being cr^X^l — r^, which is independent 
of x^. In other words, the array of y's corresponding to a particular 

value Xi of x have a mean deviating fron^ Y bv and a standard 

, _ ' 

deviation cr^\/l—ri. 

In particular, when a:=0, z=/ie a normal distribution 

for y, the mean, corresponding to greatest frequency with z=/z, 
being determined by the intersection \vith the surface (3) of the 

planes a;=0, —=r—, and the standard deviation being r^. 

By putting *=8ome constant, k, and arguing just as we did in the 



case of two independent variables, we find that all values of z and y 
which occur together with the same frequency define points (z, y) 
which lie on the ellipse 


z=i, 




The different ellipses which can be obtained by var 3 nng the fre* 
quency, and consequently varying c, are concentric, similar, and 
similarly situated, if they are orthogonally projected on to the 
^we *=0. The planes giving the means of the x’s, or the most 
sequent x’s, corresponding to particular values of y, and the means 
of the y’s, or the most frequent y’s, corresponding to particular 
values of x, meet z=0 in the lines of regression 


-f_, ——. 

8 


a, 
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If we alter the scales of x and y by writing — =^x' and — =y' 


cr, ay 


so that unit change in each shall be of the same magnitude, the 
frequency surface takes the form 


When y'=0, z=fie , a normal distribution, the mean being 

on the plane x’=ry', and the standard deviation being 

Similarly for x'=0. When y'=y\, z=/ie e ^ ^ , a 

normal distribution, the mean being on the plane x’=ry', and 
the standard deviation being Vl—as before. Similarly for 


x' =x\. 

Again the ellipse which is the locus of the points (x'y') obtained 
by putting 2 =constant, k, corresponding to variables which occur 
with the same frequency, is (in the plane z=k) now 

x'’‘+y’^-2rxy =c, 

and, projecting on to the plane z=0, the lines of regression are 

x’=ry', y'=rx'. 

These lines are the intersections with 2=0 of the planes containing 
the means of the x'’s, or the most frequent x'’s, corresponding to 
particular y''s, and vice versa. 

Since, geometricaUy, the transformation — =x', — =y', is equiva- 

<7a* Oy 

lent to an orthogonal projection, we may learn something about 
the more general ellipse by considering properties of the simpler 
projected curve which are not changed by projection. 

Let us first, however, find the magnitude and direction of the 

axes of 


x'2_{_y'2_2rx'y' 


By turning the axes through some 
reducible to the form 


—+^= 1 . 
0 * 62 


c. 

angle B this equation i® 


which is the ordinary form for an ellipse when its axes lie along 
the axes of co-ordinates. But the equation in x', y' is clear y 
symmetrical about the lines y* =x' and because y and x 

or y' and — x' can be interchanged without the equation bei^ 
affected. Hence these lines must give the directionfl of the major 
and minor axes. 
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To turn the axes of co-ordinates through an angle of 45", fig. 
(62), we nmst write 




x' =x'’ cos 45"— y“ sin 45" 

V2 


y’^x" sin 45"+y' cos 45"= 


V2 ' 



The equation of the ellipse thus becomes 


i.e. 

t.e. 


(x'-y")*. ^Sx‘-y’){x'+y')_ 

*'2_j_y'2_r(a:'2_y-2)=c, 
»-2(l_r)+y'2(l+r)=c. 

^+ 11 = 1 . 


1-r 1+r 


Hence the semi^maior axis is a= and the semi-mi 

\j 1—r 


minor axis 




l+r 


We note that as r increases from 0 to 1, a increases 


from VctooD.whileft decreases from Veto Also, 

from Oto —1, a decreases from Vc to while 6 i 

Veto 00. 


as f decreases 


increases from 
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The ellipaes, x'^-\-y'^—2rx'y' ^c, corresponding to different values 
of r all pass through the points of intersection of 

x'^-\-y'-=c and x'y'=0. 

But a:'2+y'2=cis what the equation of the ellipse becomes when r, 
the cocfiBcient of correlation, vanishes. The connection between 
these curves is shown in fig. (53), which represents their projection 
on to the plane «=0. A positive correlation between x and y 
might be expected to increase the y corresponding to a particular 
positive a:, if the frequency be fixed beforehand, and that is the 
effect which the figure also would suggest. 



Fro. (B3). 

Now, in a:'2-fy'2—2rx'y'=o, 

the lines of regression are 

y’ =ra:', y' =ic', 

f 

and the axes of the ellipse are 

y'=x', y' = — 

Hence the lines of regression are equally inclined to the axes of the 
ellipse as well as to the axes of co>ordinates, fig. (54). 

Further, the pair of lines 

y'=x', y'=—»' 

form a harmonic pencil with the pair 

*'=0, y'-O, 

and also with the pair 

f 

This is obvious from fig. (54). 
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Now project back to the ellipse 

±+t—1r^ 


=constant. 


The algebraical transformation for this is merely 




Since the harmonic 
have the pair of lines 


property is unaltered by projection we then 


Oy 0*2 Oy 

harmonic with the pair 

x=0, y=0, 

and also with the pair 

y _x y 1 X 

——- « -- 

Og (Ty T (Tg 

Hence the two lines of regression corresponding to maximum 
correlation (r=+l and r=—1) are harmonic with 

(1) the axes of co-ordinates; 

(2) the lines of regression for any r* 

Again it may be easily seen that the lines 

and z'^O 

Me conjugate diameters of the ellipse 

r'*+y'»-2ra:y=o, . • . (6) 

for they may be written as one equation thus; 

r»'«-x'y'=0. 
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and this represents a pair of lines harmonic with the (imaginary) 
asymptotes of (5), namely, with 

x’^-\-y-—2rx'y'=0. 

[The criterion for ax--{-2hxy-{-by-=0 

to bo harmonic with a'x--\-2h'xy-\-b'y’=0 

is ab'-\-ba’—2kh'.'] 


But it is a well-known property of conics that any pair of lines 
harmonic with the asymptotes are conjugate diameters of the 
conic. 

Similarly it may be shown that the lines 



and y' =0 


are conjugate diameters of the ellipse (6). 

But, on projection, the conjugate property also is unaltered. 

1/ X 

H'lnce the lines — =r —, x=0, 

(Xy (Xg. 

y lx 

and the lines —=-, y=0 

<Jy r (Xg. 

are conjugate pairs of diameters of the ellipse 

^+J^l_2r^=c. 

But for conjugate diameters the midpoints of all chords parallel 
to either lie on the other. 

Thus we come back again by another route to the famihar line of 
regression theorems that, for a given r, all arrays parallel to x=0 

have their means on-^=r-^, and all arrays parallel to y=0 have 

X y 

their means on — 

<7« 



APPENDIX 


1. Compound Interest Law. If the capital increasea oontinuoiisly. 
instead of going up by jumps at the end of stated periods, the con¬ 
nection between the original principal S^, the rate per cent, per 
annum f, and the amount S, at the end of i years is given by 

for the rate of increase is measured by 

dS_ rS 

(77“Ioo’ 

which leads at once to the above equation on integrating. 

Other instances of the same law are :— 

(1) A particle moving against a resistance proportional to its 

velocity, v, 

where v^ is the velocity at time t, is the original velocity, and c is 
some constant. 

(2) The variation of ike pressure of the atmosphere with height, 

where is the pressure at height h above a surface level, is the 
pressure at the surface, and c is some constant. 

(3) The rate of cooling, 

where 6^ is the excess of temperature at time t of the hot body 
over that of surrounding bodies, is the excess when the measure¬ 
ment begins, and c is some constant. 

2a Weighted Ueam Let the observations be represented by the 
different values, Xi, Xg, . . . of the variable concerned, and let 
the respective weights attached to these observations be/i,/^, ♦ . . 
so that the average, by definition, 

• * * +/« 
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Now, suppose a different set of weights be chosen, namely, 
S'vS\> ’ • •/'n> giving a new average 

/’ 1 +/ 2 + • • - +/'n 

The difference between these two expressions 

• • • _• • . 

/1+/2+ ■ • • /'l+/' 2 + ■ • • 

_ (/ 1+/ 2+ • • •)(^l/l+^2/2+ • • •)~(/l+/2+ • • ■){Xif i + 2-f • • 0 

C/l+/2+ • ■ •)(/'l+/'2+ • • •) 

^ 2 ) fif ^2)! + |/i/3(^i ^ 3 ) —^3)l~i~ • • • 

(/l+A+ • • •)(/'l+/'2+ • ■ -) 

■ ■ ■ 

(/ 1 +/ 2 + • ■ ■)C/'l+/' 2 + • • ■) 

Hence this difference is very small and the averages are very 
nearly equal if the weights /j, /j, /a . . . are replaced by others 
f 'v f ' 2 ’ f'z • ‘ ’ very nearly proportional to them, so that /i//'j. 

fzlf'z • • • not far from equality, and this is the more 
pronounced if the observations ajj, Xj, . . . themselves are all 
of the same order of magnitude and the sums of their weights, 
£f andl^f, are large so that the expressions of tyTpe{Xi—Xi)/{£f)(2!f) 
are small. 


3. Geometric and Harmonic Means. Given n numbers 

a, b, c .. . 

their geometric mean, g, is defined by the formula 

g=^iabc . . . ), 

and their harmonic mean, h, is defined by 

. . . 

h a b c 

a=6=c= . . . =k, say, 

g=':j{kkk .. .)=:;y(fc")=*, 

=- 


We note that when 
then 

and 


BO that 


h^k. 
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It is worthy of remark that if the geometric mean be adopted as 
average in discussing the index numbers of prices it possesses an 
interesting property which does not hold for any of the other means 
in common use. 

Suppose the prices of n standard commodities at three successive 
dates be represented by (aj, b^, . . . ), (Og, 62- ^2 • • •)» (^3> 

Then the index numbers of the separate commodity prices at the 
third date, taking the prices at the first date as standard, are 

100-^ 100^. 100^ . . . 
a, 6, Cl 

Hence the geometric mean of these n index numbers together 

(lOO-^xlOO-'xlOO^X . . . 

\ *1 Cl 

=100^^(036303 . . ■)/'s/(®i^lCi • • •) 

= 100!73/3 i, 

where g^, denote the geometric means of the n prices at the two 
dates. 

It follows that the ratio 

index number of prices at 3rd date with prices at Ist date as standard 
index number of prices at 2nd date with prices at let date as standard 

_ 10 Qi 73 /gi 

lOOgJgi 

= 93191 - 

It is therefore quite independent of the particular date chosen as 
standard. 

4. The Mean of Combined Sets 0! Observations. (1) Suppose one 
variable z is expressed as the sum of a number of other variables, 

thus x=a+6-i-c+ . . 

and suppose that we have n different values of the variables, giving 
equations of the tyjie 

*1=01+61+014- . . • 

3^=03+63+03+ » * . 




<1 • • 
*«=o,+6„+c*+ . . . 
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Hence, by addition, 

2^i+^ 2+ • - +x„=:(aj+ . . +o„)+(6j-|- .. _(_6j_|_(c,+ .. -f c„)+ .. 

80 that nx=nd+ 7 ib-\-nc-\- . . . 

x=d-\-b-\-c+ . . 

where x. a, b . . . denote the means of the n values of the respec* 
tive variablcs- 

Thus the mean oj a sum equals ihe sum of the means, and, if some 
of the positive signs in (a+6-Hc+ . . .) are made negative, there 
will evidently be a corresponding change of sign in (a-f-6-{- . . .). 

Example .—Suppose 100 family budgets are collected and the 
items in each are separated under five heads—rent, food, clothes, 
coals and light, sundries. The expenditure, x, in each budget would 
thus be expressed as the sum of five variables, a, 6, c, d, e, and the 
mean of the 100 different a;’s would equal the sum of the means of 
the a’s, the 6’s, the c’s, the d’s, and the e’s. 

( 2 ) Sets of observations are made which differ in locality or time or 
some other respect. To find the resultant mean. 

Let I observations of the variable x refer, say, to one date, 
m ,, „ „ „ „ a second „ 

n ,, ,, ,, ,, ,, a third ,, 

and so on, and let the means of these successive groups of observa¬ 
tions be ±1, x„^, x„, . . . , so that we may write 

£i=Zxill, £^=ZxJm, £„=Zx„ln. . . . 

If then £ be the resultant mean, we have 

._ Zxi-\-Z x„,-^ . . . _m- 77 ix„+ 


»» 




* • 


f+m-l- 


l~\-‘rn-\- 


Example .—If the school children in the different schools of a 
county are weighed, I children in one school, m in another, n in 
another, and so on, giving mean weights £[, £„, , the 

resultant mean weight for the children in all the schools combined 
is then given by the above expression. 


6 . Mean and Standard Deviation of a Distribution of Variables. 
x^, x^, , Xn denote the deviations of each value, or group 

mid-value, of the observed organ or character when measured from 
some fixed value, and let /j, /j, A . • • /« denote the observed 
frequencies of these respective deviations. 
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The arithmetic mean of the variables is thus given by 

* = • • • +/na^f»)/(/l+/2+ • • • 


referred to the fixed value as origin. 

We may conveniently represent the deviations x^, Xj, ^3 ... by 
lengths measured from an arbitrary’ origin 0 along a straight line, 
in which case the point 0 defines the position of the fixed value 
from which the variables are measured. 

Let P mark the position corresponding to a typical variable and 

let G mark the position corre- _ -t-. S -► 

spending to the mean, £. Thus -^-O P 

OP=a:, 0 G—£, and if we denote ..... 

the distance of P from G by we have 

x=£-\-^. 


Hence 


*=(/i®i+/2*2+ • • ■ +/«ar„)/(/i+/2+ • • • +/") 

=[/l(^ + ^l)+/2(* + ^ 2 )+ ■ ■ ■ +/„(:g + ^«)]/{/l+/2+ . ■ • +/n) 

= [^{/l+/2+ ■ - - + - . - +MJ]//i+/2+ ■ ■ • +/") 

=«+(/lfl+/2^2+ • • • +/n^n)/(/l+/?+ - - • +/")■ 

Therefore * * ' * 

The expression • • ■ +/n^n) called the first 

moment of the distribution referred to 0 as origin. We conclude that 
when the distribution is referred to G as origin, t.e. when deviations 
aremeasured from the mean of the distribution, thtfirst moment vanishes. 


Feequenoy Disteibution Table. 


DevlaUcneof Var- 
inbies from tome 
fixed v&lue. 

Frequency of 
Deriatioae. 

Product of Not. 
in Col. (1) and 
CoL (2). 

Product of Not. 
in Col. (1) eod 
CoL (3). 

*» 

f % 

A 

/. 

/. 

/i*i 

/**« 

/»*3 
• • 

A*.* 

e e 

• # 

U 

» • 

A=^n* 

• • 

N 

N'. 

N', 


In the notation of the above table, where the dashes are omitted 
in N,, N, when the mean is origin, we have 

i=N'i/N and N,-=0. 
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Again, the root-mean-square deviation, 3 , measured from the 
arbitrary origin 0, Is given by 

. . . +AV)/(/.+A+ . . . +/,) 

—N 2/N, 

and N'a is caUed the second moment of the distnbution referred to 0 
as origin. 

Substituting as before we have 

»^=Ui(x+i,f+ ... +ui+u^]/(f,+ ... +/„) 

■ ■ ■ +f„)+2l(M ,+ . . . +/, f „)+(/,fi+ . . . +/„!„=) 

(/i+ • ■. +/„) 

=*H(/xf=.+ . . . +/„f„^)/(A+ . . . +/„), 

since 

Hence s*=i>+c^. . . , (j) 

where a la the root-mean-square deviation measured from G as 

origin, or the standard deviation as it is called. 

From this result it is clear that u is always less than s, or the root- 

mean-sqxiare deviation is least when measured from the arithmetic 
mean. 

Generally, if we write 

- • • +A). 

= . . . Af/)/C/,+ . . . +/„), 

where S{f3^) and may be called the fcth moments referred to 0 

and to the mean as origins respectively, so that i/i==0, 
v\=s^, we have 

i^'t=[/i(^i+*)*'+ . . . +A(f„+i!)*]/(/i+ . . . -f/J 

k{k-l) 


1-2 




(/l+ ■ • • +/b) 

For example, when i=2, since vo=l yi=0i 

v^=v\—V . 

Again, when * =3, . 

and, when i=4, . 

There are interesting statical analogues to the above results 
concerning the mean and standard deviation. 


(2) bis 

. (3) 
. (4) 
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Let U3 imagine a set of weights, /g . . . suspended at 

Pj, Pg, Pj . . . from a straight horizontal bar, and let the distance 
of any typical weight / from some arbitrary origin 0 on the bar be x. 
Then the first moment, 

(where some of the z’s may be negative corresponding to weights 
suspended to the left of 0) measures the total turning effect of all 
the given weights about 0, and if we further imagine all these 
weights replaced by a single weight 
equal to their sum (/ 1 +/ 2 + • • • 

+/„), then, in order to produce 
the same turning effect, it would 
have to be placed at a point G, the distance of which from 0 
is pven by 

• ■ • “H/n) 

Thus «=(/ia:i+/ 2 a;j+ . . . +/„3:„)/(/i+/a+ ■ • ■ +/„). 

and, statically, this defines the position of the centre of gravity of 
the given weights, A, A. • • • A. relative to 0. 

As before, £=£j(x-\-^)/Sf 

==£-\-£(fi)f£f; 

hence A^i“hA^ 2 “h ■ * * 

and, statically, this means that the turning effect of /j, A • • • A 
about G is zero, in other words, the bar would balance freely about G. 
Again, the second moment, 

A^*i'hA^>~h * • ■ 

measures the moment of inertia of the weights A. A • • • A about 0. 
and, if we imagine these different weights replaced by a single 
weight (/i+A4- • • • +A) before, the moment of inertia will 
be unaltered if the latter be located at a distance s from 0, whore 

(/i+A+ ■ • ■ +A)«®=(A*“i+/a^S+ ■ • • +A^«*)( 

therefore . . . +A^n’‘)/(A+ ■ ■ * +A) 

as before, and the interpretation of this is that the square of the 
radius of gyration of the system of weights about 0 equals the 
square of the radius of gyration about G, the centre of gravity of 
the system, together with the square of the distance of G from 0. 
Also, s is clearly least when it is measured from G. 
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fi. The Olean Deviation a Minimum when measured from the 
Median. Consider first the case when only two different values of 
the variable are observed, Xj, Xj, and let their deviations from an 
arbitrary value, 0, chosen as origin, be respectively Xj, x^. 

^ /u fz observed frequencies of these values, the sum of 

their deviations from 0 is 

(/iXi+ZjXjJ, 

O .r, X, ^ ~ ^^ch is clearly less when the 

' ^-y* value 0 lies between Xj. Xj 

* than when it is smaller or 

-v, O __Xj greater than both of them. 

/ / Choosing 0 , therefore, be- 

^ ' - - - A ->. Xj, Xj, if /j be the 

greater frequency we write the deviation sum 

=/2*+(^,-/2)X„ 


where x is the deviation of either of the values Xj, Xj from the 
other, and (J1—/2) is positive since/i>/2. 

Now this is evidently least when (/j— fi)Xi vanishes, i.e. when 
(1) Xj=0, in which case 0 coincides with Xj, the more frequent of 
the two variables, or, when (2) fi—fi, and in this case, when the 
two observed values occur equally often, the deviation sum is 
constant for any origin between X^ and Xg. 

When several dififerent values of the variable are observed, they 
may be arranged in order of magnitude, Xj, Xj, X3 . . . X„, from 
the least to the greatest, with frequencies /j, /j, /g . . . /„. 

If/i>/„ we pair off/„ of the X„’s with f„ of the Xj’s ; the devia¬ 
tion sum for this pair is least and remains constant when measured 


from any origin between Xj and 
X„. We next pair off some or all 
of the Xj’s which remain against 



an equal number of X„_j’8 and the deviation sum for this pair is 


least and remains constant when measured from any origin between 


Xj and X„_i. If some Xj’s still remain, we pair them off so far 
as we can against an equal number of X,.,_j’b but, if it be s 


that remain, we pair them off against an 


equal number of Xj’s. 


This process can evidently be continued until ultimately we 
reach the origin from which the mean deviation of the whole 
distribution is a minimum, for if any X be left unpaired the origin 
will coincide with that X. Otherwise, the deviation is least when 
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tneHSur^d froin flJiy value between the last two X s paired ofi 
together, and witliin that range it is constant. 

Since, by deanition, the median is the value of the variable lialf- 
way along the series of given observations, ranged in order of their 
magnitude and assigning each its due weight or frequency, it is 
clearly such that a balance can be effected by pairing off the values 
on either side of it against one another in the manner explained 
above ; it therefore follows that the mean deviation of a frequency 
distribution is a min imum when the deviations axe measured from 


the median. 

The statical analogy to the median also is worth noting. With 
the same notation as before, the moment or turning effect of two 


- - .r, -- 


O 


X. 

T 

A 

L 






•*■1 


X. 


o 


forces, /i, /g, about 0 is 

But in this case, if 0 be taken 
at some point in between Xj 
and X 2 , since the mean devia¬ 
tion sums the separate devia¬ 
tions without regard to sign, 
we must imagine /j reversed 
so as to produce a turning effect in the same direction as before. 

The moment will then be still and it is less when 0 

occupies such a position than when it is on XjXj produced in 
either direction. 

Taking 0, therefore, somewhere in between Xj and X 2 > the moment 
may be written 

=/j(iCi+X2)+a;i(/i—/a); 


1 

/ 


and, if/i>/ 2 , this is least when Xi vanishes, that is, when 0 coincides 
with X|, but if /i=/ 2 , the two forces constitute a couple, and tbe 
moment is the same whatever position 0 occupies between X, 
and X 2 . 


7. The Method ol Least Squares. To the student who is un¬ 
acquainted with the differential calculus, the foUowing descriptive 
argument, the basis of the principle of least squares, for determining 
the values of m and c which make 

(mxi+c—yi)*+(»na:,+c—y,)*+ . - ■ +(«“:«+<:—Vn)* ■•■(f) 

a mixumum^ may prove instroctiye. 

iiCt us call the above expression E and let us suppose that different 
values are given to tn while c remains unchanged; in that case E 



272 


STATISTICS 


will vary xvith m, and we might imagine the different valne5: obtained 
for E plotted against the corresponding values of m giving a curve 
of some type. Such a curve may rise and fall in wave-like fashion 
as in the figure, resulting in maximum points like A and C, and 
minimum points like B, where we define a maximum point to be 
such that, as we move away from it along the curve, whether to 
left or right, the size of the ordinate (and therefore the value of E) 
decreases; likewise, a minimum point is such that, as we move 
away from it, the ordinate (and therefore also E) increases. In 
the neighbourhood of such points it is clear that the size of the 

ordinate, such as Aa or B6, 
changes so slowly as to be 
practically stationary. 

Suppose then that m and 
(»n+/x). It being very small, 
are two values of m respec¬ 
tively at and near a minimum 
position on the curve, i.e. a 
position like B corresponding 
to a mini mum value for E. 
Since E near such a point 
does not differ appreciably from E at such a point, we may prac¬ 
tically equate the two expressions obtained for E by substituting 
(jw+^) and m respectively for m in (1), thus 

(m+/iari+c-yi)2+(7n-f-/iar,-l-c—. . . 

=(77uri-|-c—yj)2-f-(mx2+c—. ♦ . 

=(ma;i-f-c—yi)2-l-(mxj+c-yj)*+ . . . 

=(wa:i+c—. . . 

Thus [2a:,(m*iH-c—. . . =0. 

Now, the smaller we take /i, the nearer to the truth does this 
result become. Hence, by making ft tend to zero, we are led to 
the strictly true relation 

Xi(mxi-l-c—yj)+ . . . =0. 

This is one of the equations in the text. To obtain the second, 
we keep m constant and vary e. 

Suppose c and (c+ 7 ) two values of e at and near a mfnimup 
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position on the curve; then, equating the two corresponding 
values of E, we have as before 

(in.ri+cT 7 -yi)-+ ■ • • =(^tt^i+c-yi)"+ ■ • • 

(mxj+c— 2 / 1 + 7 )*+ . . . =("iJ^i+c—.Vi)^+ • • • 

[(mxi+c—yi)“+27(7nxi+c—yi)+7-]+ ■ • • =(mxi+c— 2 / 1 )-+ - . » 

Thus [2(7n.Xi+c—2/i)+7]+ - • • =0- 

and, proceeding to the limit when 7 tends to zero, we reach the 

other equation in the text, namely, 

(mxi+c-t/i)+ . . . = 0 . 

[The Method of Least Squares came first into prominence in 
Astronomy in connection with the determination of the best value 
to take when a number of observations, apparently equally reliable, 
give results not quite in agreement. If, for instance, x be the true 
value of some variable, and if x^, x^, Xg . . . be the results of 
n observations, the method of least squares assumes x to be given 
by making 

y^{x-x^)^+(x-x^)^+ . . . +(x-x„)“ 

a mini m um . 

Now ^= 2 (a:-Xi)+ 2 (x-X 2 )+ . . - + 2 (r-x„). and this vanishes 
dx 

when (x—Xi)+(x—* 2 )+ • * • 

*.e. «={Xi+Xa+ . . . +Xn)/^> 

so that in this case we are led to the ordinary arithmetic mean of 
the n observations as the best value. 

The method was used by Gauss as early os 1795.] 


8 . To prove 
Let 


/ +® 


e"**dx='v/w. 


I=r"e-^dxi 

J^OO 


thus, also, 
therefore. 


1=1 e-'^dyi 

J^CO 

/ +CD /+^ ^ 

e-^dxJ (T^dy 


■j e-^‘^+^^dxdy 
■ r* /* t-^rdrdB 


a 
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(by changing to polar co-ordinates) 




dd 


r 

CD 

! -- 


-1 

_J 

D 


2 » 

0 


Hence 


9. To prove :— 


=(i)(27r). 

1 =/ 

y^co 


(1) r(n+l)=nr(n). (2) B(m. n)= 


r(m)nn ) 

r(m+o) 


(1) P(»+l)=j^ x^er^dx 

=— I 

=j^— x"-^e-*dx 

=nr(n), 

because the expression in square brackets vanishes at both limits 

( 2 ) r(m)r(n) =j^e->v”^^dv 

=e-^z^”*~~2xdzj e~^y^"''~2ydyf 

where x'^=^, 

Hence r(m)r(n) =4j[”£** 


= f” f Q-f^f2m+2n-2 cos®"**"^ rdfdB 

Jr^oJe=0 


(by changing to polar co-ordinates). 


Thus 


r{m)r[n) =j^ c-^i^'n+sn-Jdr^ cos^*6> 


where p=f* and A;=sin2d| 

therefore, r(m)r(n)=r(w»+n')B(n, m) 

=r(n»+«)B(wt, n) 


by symmetry. 
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10. Elementary Method of Testing the Probability Integral Table. 
The reader may find more satisfaction in using the probability 
integral table if he tests for himself one or two of its results by 
means of squared paper or in some other way. 

We have seen that the probability of an error between 0 and 
iS given by the expression 



Put ^=V2a:, and this becomes 



=area OBPN/area A'BA, in the figure. 


Now the graph of y=e“®* is dra^vn in fig. (40) of the text, and it 
is possible therefore to get an 
approximation to the above 
result for any value of x by 
counting the number of small 
squares in that figure enclosed 
by the areas corresponding to 
OBPN and A'BA respectively. 

Each complete small square 
may be reckoned as 1, and each 
portion of a square may be 

reckoned as 1 if it exceeds half a square and as zero if it is less 
than half a square. 

This gives, for example, 

-4=. r ”e-^da:=98/707 =0 139, 

Vw-'O 

whereas the tables give 0*138. 

For a value like a!=0-71, count the squares in the usual way 
between curve, axes, auid ordinate a: =0*70 ; then add to the result 
OQe*fifth of the number of squares in the small slice of area between 
curve, axis, and ordinates a?=0"70 and x=0’76. We get 

-L e-^(fa;=240/707 =0*339 
Vw-'o 

as compared with 0*342 from the tables. 

These resolts are not unsatisfactory considering the rough nature 
of the method followed to obtain them. 
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11. The Law of Frequency in the case of two Correlated 
Variables with certain Deductions therefrom--[based on Professor 
Karl Pearson’s memoir, Rctjression, Heredity and Panmixia {Phil. 
Trans., vol. 187 a, pp. 253-318)]. 

Consider two variables whose deviations, x and y, from their 
respective means are due to a number of inde-pendent causes, the 
deviations in which from their means can be quantitatively denoted 

by €i, € 2 » • • • 

We assume that each e deviation is so small compared to the 
mean value from which it is measured that x and y can be sensibly 
expressed as linear functions, thus 

• • • ( 1 ) 

• • • ( 2 ) 

(Some of the a’s and 6*8 may be zero, and if x only involved, say, 
fi, €2 • • • €fc, and y only involved . . . e„, then it would be 
natural to expect no correlation between x and y.) 

We further assume that each e varies according to the normal 
law with S.D. <j with appropriate suffix. 

Equations (1) and (2) show that the same x and y may arise in a 
multitude of different ways obtained by varying the e's so that 
their weighted sums (the a’s and b's being the weights) remain 
unaltered. The probability that the particular deviations lying 
between 

e2.,(€2+Se2)» • • • 

shall concur, since they are all independent, is 

z-( . . . ( 

\o'iV27r / \<j^y/27T f 

But, writing 

^8^3+ • • • 


equations (1) and (2) become 


Therefore 


ai«i+®2«3+ (a—ic) =0 

_ 5_=__i-, 

OjOS— y)—6j(a—x) 6j(a—x)—a^OS—y) ai6a-"®2^i 


And, for any function v. 



Sfii Se* Se* bej 
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Henoe 

8 a: 8 y ^-(^+• • •+ 2 ^) 

“A—. or„)(277r^ 

(6|r - + ( atM - *«| /-hb|a~ 

^ ’ 2<r^2(uib3-ci«Sf^ . Sfj • • • 

The total probability for deviatioas between ^^(x+S^:) and 
y^(y+Sy) is obtained by integrating z between limits —co and 4-CO 
for all the €*8 from to €„, and it is not very cliliicult to see that 
this will ultimately lead to an expression of the form 

C . SxSy . 

This is the required law of frequency. 

To find the meanings of the constants a, 6, k. The total probability 
for a deviation between x (x+8x) associated with any deviation y is 

=CSx 

/-03 

J •CO 




/•+« 


hx\2 


^Chx.e -V e 




CO 


=CV7T/68xe-*®(“^-'^>'‘. 

But if X be subject to the normal law, the probability for a devia¬ 
tion between x^(x-f Sx) is 

V27r.a* 

where a* is the S.D. of x independent of y. 

Comparing these two results, we have 

l/2<r,*-(ab-h*)/b= a(l-r^), 
if r=—hl‘\/ab. 

Similarly, l/2<7,*=(ab—h®)/a=b(l—r*), 

BO that h= —r'\/ab= —r/2a,ay(l—t*). 

Again, we may integrate z for all values of t and y, and so get 
the total frequency, N, of the (x, y) pair. 

Thus. N=C 

J^co «/-co 

J -09 

» C VW* Vff V'fft/fo*—A*)]. 
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Henca 




=_V[a6(l-f2)] 

ir 


Thus 


27raj^„\/(l—r*) 

- * f'®-?r?s'+v2l 

*=Ce 2(i-r2)L^^ ^-IS.rSy, 




where C has the above value. 

It still remains to interpret r and to see that it is really the 
coefficient of correlation as defined in Chapter x. For this purpose 
let us suppose we have observed n pairs of associated z’s and y’s. 
namely 

(^ 2 ^ 2 ) ■ • • (^n2/n)' 

The probability for such a concurrence, taken along with a given 
value for r and assuming the observations independent, is pro¬ 
portional to 

1 _ I _ 1 2nt.y. . y«n 

- ,e 2(l-ra)Lff3 ^ ^ ^ ^ ^_g 

va ~ r ^) 

1 __J_r5^ 8r2^ . lygl 

(l_r2)"/2 

== (1 _ 7-2)-"/2e " 20^'^" " 
where k =Sxyjna 3 fj^ 

Now the probability of this particular distribution ie greatest 
when 

i *og 

1—r* 

is least, and, differentiating with respect to r, this leads to 

2 r 0^-i^)^)+2r{\-KT) 

(l-r2)a 

t,e. —r(l-r2)-K(l—r2)+2r(l-Kr)=0, 

t.e. —r+r®—*d-«7^4-2r—2j¥r2=0, 

i.e. (f—/f3(l+r2)=0. 

It is not difficult to show that t=k gives a minimum ; hence the 
required probability is a mazimum and we get the best value for 
the coefficient by taking 

r —K =Sts /na^ay. 


t.e. 

i.e. 
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CERTAIN CURRENT SOURCES OF SOCIAL STATISTICS 

Any one who is anxious to get reliable figures bearing upon some 
social matter is somewhat at a pause unless he is thoroughly con¬ 
versant with all the statistical ramifications of Government autho¬ 
rities, local and national, of trade unions, friendly societies, and 
hosts of other bodies of a public or semi-public character. 

While recognizing the lavish outpouring of statistics of all kinds 
upon a multitude of diverse topics every year, and appreciating the 
immense care and patience shown by those who are responsible for 
their collection and preparation, one cannot but deplore the lack 
of any co-ordinating principle in general between one body and 
another either in deciding what statistics shall be collected, by 
whom and when they shall be collected, or how afterwards they 
shall be tabulated and presented to the pubUc. Too often a narrow¬ 
minded jealousy prevents one authority from consulting with 
another, and such co-operation as does exist is due largely to the 
efforts of able and enlightened individuals. The result is that a 
vast amount of labour and expense goes waste and the loss to the 
public is incalculable, but the public do not care, and they do not 
care because they do not know. 

At present, to quote from an influential petition on the subject 
recently presented to His Majesty’s Government, * It is almost 
universally the case that any serious investigation is reduced to 
roughly approximate estimates in relation to some factor which is 
essential for its result. ... It is not too much to say that there is 
hardly any reform, financial,social, or commercial, for which adequate 
information can be provided wdth our present machinery. But 
this state of things would be partly remedied by adequate control 
such as might be secured by the establishment of a central statis¬ 
tical office with a minister in charge who should be responsible for 
unification so far as possible in the collection, tabulation, and issue 
of all public statistics. 

It is scarcely possible for a single private individual to make 
a quantitative investigation of any social question on a large enough 
scale to produce results of real value; conspicuous instances like 
Booth and Rowntree might seem to be exceptions to this rule, but 
even they had a number of workers acting under their direction, 
wdthout whose wd their fatak would have seemed almost hopeless. 
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For such statistics ag we have we are tlierefore dependent upon 
Government departments, local authorities, public officials, trade 
associations representing employers or labour, public companies, 
and so on. The reader who wishes to get some idea of the extent 
and the limitations of official British statistics is referred to the 
admirable introductory chapters of Bowley’s Elements of Statistics. 
Here we cannot do more than mention a very few of the most 
important sources whence such statistics are derived. 

The most voluminous of all our records is probably the Census 
of the Population which is taken every ten years. Its scope is but 
faintly realized by enumerating the chief subjects on which the 
Registrar-General obtained information in 1921 by means of 
({U6stions submitted to each householder i 

(1) Numbers and Geographical Distribution of the Population. 

(2) Nationality and Birth-place. 

(3) Numbers at Different Ages, Male and Female. 

(4) Numbers Single, Married, Widowed, and Divorced. 

(5) Dependency and Orphanhood. 


(6) Sizes of Families in relation to Housing 

(7) Numbers engaged in different Industries and Occupations. 

(8) Number of Employers, Employees, and Independent Workers. 

(9) Workplaces in relation to Places of Residence. 


This may seem an ambitious scheme when it is stated that the 
mere enumeration of the people was successfully opposed less than 
two hundred years ago as « subversive of the last remains of English 
liberty and likely to result in some public misfortune or an epidemi 
cal disorder,’ and the first census was only taken in 1801. [See 
Article in the Encyclopaedia Britannica on the subject.] 

The results of each census are published in bulky volumes as 
soon as they can be reduced and tabulated, a process w’hich, of 
course, takes a considerable time even for an army of workers 
with calculating machines and every modern device to facilitate 
their progress. It is to be regretted that more is not done to 
advertise so valuable a record of work by publication in a cheap 
and attractive form of a summary of matters which vitally affect 
the good of the commonwealth. As it is, the census volumes tend 
to be purchased only by public authorities and officials who require 
to use them occasionally as books of reference. 

Neglect of the blandishments of advertisement—to be commended 
in general because such neglect is somehow associated ^vith the 
presentation of all truth—may be perhaps carried too far in the 
issue of statistics. 
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It will be noted that in the periodical census no mention is made 
of wages though the people are classified as regards occupation, 
and for information upon this point we must turn to another source. 
The last general census of wages was taken in 190G, following 
and improving upon an earlier inquiry twenty years before, but, 
in connection with an inquiry by the Board of Trade into tlie cost 
of living of the working classes, information was collected as to 
rates of wages in 1912 of workpeople in certain occupations in the 
building, engineering, and printing trades, these being selected as 
industries common to most towns, and because the time rates of 
wages paid in them are largely standardized. 

The 1906 inquiry into earnings and hours of labour, unlike the 
decennial census, w'as conducted on a voluntary basi.® and was 
never wholly completed. In brief it set out to discover from 
employers :— 

(1) The Numbers of Working-people Employed in Various 
Occupations, distinguishing Men, Women, Lads, and Girls. 

(2) The Nature of the Work done and the Rates of Wages Paid, 
distinguishing Time Rates from Piece Rate.s. 

(3) The Hours Worked, distinguishing Under* or Over-time from 
Normal Time. 

The ground actually covered by the inquiry embraces the fol¬ 
lowing trades: Textiles, Clothing, Building and Woodworking, Public 
Utility Services, Metal, Engineering, and Shipbuilding—in 1906; 
also Agriculture, and Railway Service—in 1907 ; the reports upon 
these trades were published separately at different dates between 
1909 and 1912, and the following trades were bulked together in 
one volume, published in 1913—Paper and Printing ; Pottery, 
Brick, Glass, and Chemicals; Food, Drink, and Tobacco; and 
Miscellaneous Trades. 

The Coat of Living Inquiry of 1912 was in continuation of a 
similar inquiry in 1905, wliich in addition compared conditions in 
the United Kingdom and certain foreign countries. It dealt not 
only with ivages but also with rents and retail -prices. 

The report states that ‘ particulars as to the rent and accommo* 
dation of t 3 rpical working-class dwellings were obtained from 
officials of local authorities, surveyors of taxes, house owners and 
agents, and by house-to-house inquiry.’ Also ‘ returns of the 
prices most generally paid by working-class customers for a number 
of specified commodities were obtained in each town by personal 
inquiry from a number of retaUers engaged in working-class trade.’ 

Since then Lord Sumner's Committee and a Committee of the 
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AgricuJtnral Wages Board have examined (he change in the cost oj 
living betu'een 1914 and 1919, as evidenced by a number of house¬ 
hold budgets collected from among urban working-classes and 
workers in rural districts respectively. 

One other highly important inquiry carried out by the Board of 
Tra<le deserves notice, namely, the First Census of Production of the 
United Kingdom {1907}.' 

The published report shows :— 

(1) The total Net Output in Money Value for each Trade Group 
in each Industry. 

(2) The Number of Persons Employed in each Trade Group 
(salaried persons and wage-earners exclusive of outworkers). 

(3) The Net Output per Person Emjdoyed in each Trade Group 
as deduced from (1) and (2). 

(4) The Horse-power of Engines in Mines, Quarries, or Factories 
Employed in each Tra<le Group. 

It is explained that the term ‘ net output ’ here represents the 
value of the aggregate output of the factories, etc., from which 
returns were received in each trade group, after deducting the cost 
of materials purchased from factories, etc., not included in the 
group, or supplied by merchants or others not making returns to 
the Census of Production OlTice. 

Valuable as the results of these inquiries undoubtedly are, they 
would be of still more value were it only possible satisfactorily to 
collate the various returns of population, wages, and production. 
No record of wages was included, for example, in the Census of 
Production statistics, and it is quite impossible to deduce the number 
of wage-earners and those dependent upon them in any trade at 
any given time. 

Apart, however, from such special inquiries as we have instanced, 
and the ten-yearly census of the people, there are other periodical 
records issued which provide us with valuable information. The 
Ministry of Labour, until recently a special branch of the Board 
of Trade, charged with the duty of keeping in touch with labour 
conditions, issues each month a Labour Gazette giving particulars 
relating to the state of employment in the principal trades in the 
United Kingdom based on returns from employers, trade unions, 
and employment exchanges, besides information concerning trade 
disputes, changes in wages and hours, the course of prices, railway 
traffic receipts, foreign trade, etc. The Board of Trade also pub¬ 
lishes weekly a Journal and Commercial Gazette dealing with matters 

‘ Now, just twenty years later, the results of another Census of Production 
are in process of publication* 
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oi interest to all who are engagtd in cnmmeroo or finance*; while a 
MoTvthly Bulletin of Statistics of production, trade, finance, employ¬ 
ment, etc., at present issued under the name of the Supreme 
Economic Council, is an important recent addition to our knowledge 
of international statistics. 

Again the Registrar-General makes a quarterly return and annual 
summary of births, marriages, and deaths in the different counties 
of England and Wales, and of births, deaths, and infectious diseases 
in certain large towns. In each public health area the medical officer 
reports periodically upon the hygienic condition of the district and 
the health of the people under his care. The Board of Education 
is answerable for conditions in the schools, and the Horae Office 
in factories and prisons ; they report from time to time. The 
Ministry of Health similarly is.sues returns relating to pauperism 
and to housing, while the Board of Agriculture and Fisheries registers 
the acreage under crops and the number of live stock in the United 
Kingdom, and the Commissioners of Customs record the expansion 
or contraction of foreign trade. 

In addition we have the endless accounts and statistics supplied, 
some voluntarily and some compulsorily, by municipal bodies, 
public companies, banks, trade associations, co-operative societies, 
insurance companies, trade unions, etc. 

And yet, in spite of all this wealth of statistics, some surprising 
gaps occur, as we have already seen, in important particulars 
which cannot be traced. We shall quote only one more instance 
of such a hiatus—the income-tax returns provide a basis for measur¬ 
ing that part of the national income which is subject to taxation, 
some idea also can be formed of what the wage-earners receive, 
but as to the earnings of the portion of the community falling in 
between these two classes we are entirely ignorant. It is possible 
that war conditions during the years 1914-19 may have vastly 
increased the knowledge of the Government as to some matters 
such as internal resources and inland trade, of which little was 
known before, but, if so, the public, whom it concerns so closely, 
have not yet been permitted fully to share in this advantage. 

Excellent periodical summaries of Government statistics are to 
be found in the Abstract of Labour Statistics and in the Statistical 
Abstract for the United Kingdom. Also, a most useful Guide to 
Official Statistics is now issued each year by the Stationery Office, 
and Hr. Bowley’s Official Statistics will repay careful study in con¬ 
junction with it. 
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A NOTE ON TABLES TO AID CALCULATION 

The short tables which follow are only inserted as specimens, as 
it is expected that the reader Avho wishes to make extensive use 
of such tables will have access to the fuller ones to which reference 
is made below. 



Pia. (&5). 


Probability Integral Table, giving area of curve 2 =— 

V27T 

terms of corresponding abscissa, see fig. (55):— 


f 

i(l + «) 

a 



a 

•00 

•50000 

•00000 

•76 

•77637 

•55274 

•10 

■53983 

•07966 

•77 

•77935 

• 55S70 

•20 

•57926 

•15852 

•78 

•78230 

•66460 

•30 

•61791 

•23582 

•79 

•78524 

•57048 

•40 

•65542 

•31084 1 

•80 

•78814 

•57628 

•45 

•67364 

•34728 

•85 

•80234 

•60468 

•60 

•69146 

•38292 

•90 

•81594 

•63188 

•65 

•70884 

•41768 

•95 

•82894 

•65788 

•60 

• 72.775 

•45150 

100 

•84134 

•68268 

•65 

•74215 

•48430 

107 

•85314 

•70628 

•70 

•75804 

•51608 

110 

•86433 

•72866 

•71 

•76116 

•52230 

1-50 

•93319 

•86638 

•72 

•76424 

•62848 

2-00 

•97725 

•95450 

•73 

•76730 

•53460 

2-50 

•99379 

■98758 

•74 

•77035 

•54070 

300 

•99865 

•99730 

•75 

•77337 

1 

•54674 

3-50 

-99977 

•99954 


Fig. (56), the result of plotting a against enables us to estimate 
the probability of an error lying between any two limits. 
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Table giving P, to test ‘ goodness of fit,’ corresponding to certain 
values of n' and :— 


n' i 

7 

8 

9 

10 
11 
12 ! 

13 

14 

15 


5 

1 € 

7 

8 

9 

10 

11 

i 12 

1 

13 

14 

15 

•67668 

1 -TTOTS ■ 
•85712 
•91141 
•94735 
•96092 
•9834-J 
•99119 
•99547 

•54381 
•65996 
i 75758 
; -83-131 
•89118 
•93117 
95798 
•97310 
•98581 

j •42310 
•53075 
•64723 
•73902 
•81526 
•87336 j 
•91(508, 
•046151 
1-96G49 

* 

•32''8:-) 

. -428^8 

1 -SS'Uy 
1-63712 
•72.344 
•79907 
•85761 
•90215 
•93471 

^23Slo 
•33259 
•43347 
•53415 
•62S841 
1-713:^0 

1 *78513 

1 -84300 
•88933 

4 

•1735S 
•25200 
•342:>01 
•43727 i 
•53210' 
•C2181I 
•702*43 
•77294 ^ 
•83105 

•124^2> 

•18S57 

•20503 

1 •35048 
' *44049 
•r>3u39 
015% 
•r)!i:i93 
•7G218 

•088:w 

•13S»;2 
j-20170 
^ •27571 
•35752' 
•44320 
•528921 
•010821 
4i8004 

1 

[•00107 

1 ‘KHOG 
•15120 
•21331 
•28:)06 
•30204 
•44508 
52704 
60030 1 
1 1 

•04304 
•07211 

1 11185 
1-10201 

1 •22:ii>7 
•29333 
•30*N)4 
•n7vHi 
•52052 

•02004 
•051 IS 
OS 170 
•122!'2 ' 
•17299 
•23299 
•30071 
•373«4 
•44971 

•02026 
•03600 
i.o.'OU 
•09O04 
•13206 
• 18250 
•24144 
30735 
•37815 

1 


One of the earliest tables of the probability integral appeared in 
Krarap’s Analyse des Rejractiona (Strasbourg, 1798), where the 
calculation of was given to eight places from x=0 to a:=3 

at intervals of O-Ol. Tables more recent and extensive are those 
due to J. Burgess {Trans. Roy. Soc. Edin. 1900) and to VV. F. 
Sheppard {Biometrika, vol. ii., pp. 174-190). Of these the latter 



is reproduced in the admirable Tables for Statisticians and Bio- 
metricians, edited by Karl Pearson (Camb. Univ. Press, 1914), and 
the same volume also contains Palin Elderton’s P Tables for testing 
‘ goodness of fit ’ which first appeared in Biomelrika, vol. i., and 
Duffell’s Tables of the Logarithms of the P Function from Biomelrika, 
vol. vii., besides a large number of other valuable tables. 

It should be remarked in connection with the last-named table that 
the formula r{a:4-l)=:a; T{x) enables us to reduce the calculation 
of any p function to one in which * lies between 1 and 2, by repeated 
applications of the logarithmic relation, thus 

logP(a:-j-l)=log a:-1-log r(a:) 

=log a;-l-log (x—l)-|-log r(x—1), 










STATISTICS 


2ou 


and 80 oil. When x is large, however, say greater than 10, the 
wi ll-known approximate formula 

P(ar-f-l)=e 


(sec, for instance, Whittaker’s Analysis, § 110) will be found useful, 
and it may also be wTitten 

log ^:<^±i)=0.3990899+‘i:5H5i5121^ , 
x^e~* X 

a form often convenient. 

It may be of service to record here the values of a few constants 


which frequently recur for speedy reference : 

e=2-718 2818 

i = 0-367 8794 
e 

logioe*0 434 2945 
loiiio (login «)“i'637 7843 

i 

i 

fl- = 3.141 5926 

logip?r«50*497 1499 

logi»-4==.i-600 9101 
V2n 

log ,0 2=0-301 0300 
login 3=0-477 1213 


The statistician who has Pearson’s Tables, Barlow’s Tables oj 
Squares, etc., together with a good set of Tables of Logarithms 
(unless he is so fortunate as to have a mechanical calculator, for 
instance a Brunsviga, at his disposal) and of Trigonometrical 
Functions such as Chambers’s Seven-Figure Tables, may consider 
himself amply provided for serious research and decidedly better 
off than his predecessors who prepared the way for him by doing 
great work with much poorer tools. 


MISCELLANEOUS EXAMPLES 

[Selected from London B.Sc. {Econ.) Pass and Honours papers] 

PART I 

(1) Define the genus ‘ average,’ and the principal species of that 
genus. Adduce concrete cases in which (a) the Aiithinetic Mean, or 
(6) the Median, is specially appropriate. 

(2) Supposing that statistics of rents of working-class dwellings 
have been collected in a certain district for a scrie.s of years, describe 
some way of forming an index number showing the changes in rents 
from year to year during the period. Give reasons for tlie process 
you adopt, or state any advantages it app -ars to you to possess. 

(3) Measure by whatever method you think most suitable the 
correlation between the two following series, and show grapliically the 
relationsliip between the two series. 



Exports 
per hc4(l. 

Unemplcjment 

Index. 


Exports 
per head. 

Uiiemplo>’mci>t 

Index. 


£ 



£ 


1884 

6-6 

8-1 

1899 

6-5 

2-2 

5 

5-9 

9-3 

1900 

71 

2-5 

0 

6-9 

10-2 

I 

6-7 

3-3 

7 

6-1 

7-6 

2 

6-8 

4-0 

8 

6-4 

4-6 

3 

6-9 

4-7 

9 


2-1 

4 

7-1 

6-0 

1890 

7-0 

21 

5 

7-7 


I 

6-6 

3-6 

6 

8-7 

3-0 

2 

6-0 

6-3 

7 

9-7 

3-7 

3 

6*7 

7*5 

8 

8-6 

7*8 

4 

6-6 

6-9 

9 

8-6 

7*7 

6 

6*8 

5-8 

1910 

9-6 

4-7 

6 

6*1 


1 

100 


7 

69 

3-5 

2 

10*7 

3-2 

8 

6-8 

2-9 

3 

11-4 

2*1 


(4) Apply some teat by which the figures in the previous table can 
be used to determine whether unemployment (as there measured) 
increased or diminished in the 30 years. 

(6) Exhibit the difficulty of comparing nations, in respect of power 
ftnd prosperity, by means of statistics relating to (a) the number of 
population. (6) occupations, (e) criminality, (d) exports and imports. 
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(6) Draw up, with careful attention to form and detail and showing 
all sub-totals, a blank table in which could be sho^u, for the years 
1919 to 1023 inclusive, the numbers of students who entered for the 
I’inal h>xainination for B.Sc. (Econ.), distinguishing Internal and 
External Students, Pass and Honours Candidates, and the results of 
the examinations (Pass or Fail in the case of Pass Candidates and 
Honours I, II, III, Pass Degree, Fail in the case of Honours 
Candidates). 


(7) Define the geometric mean, and discuss its use in forming index 
numbers of prices. 

(8) The average prices of wheat and the quantities sold at four 
niarkets are given as follows :— 


Market. 

Average Price per Qr. 

Quantity sold, Qre. 

A 

27s. 3d. 


B 

28s. 8d. 


C 

29s. Id. 

16,000 

D 

27s. 2d. 

12.000 


Find the moan price for the four markets, weighting each local 
average with the quantity sold. 

Would it be possible for the average price at each of the above 
markets to rise from one j’car to the next and yet for the weighted 
mean price to fall ? If so, under what conditions ? 

(9) Illustrate the necessity for standardisation when hetero¬ 
geneous groups are in question by describing the methods of comput¬ 
ing standard birth- or death-rates or fa mil y food-consumption. 

(10) The following are the Ann ual Premiums required to secure 
at death £1000 plus a Guaranteed Reversionary Bonus of £2 per cent, 
on the sum assured under the Whole Life Policies of a certain 
Assurance Company :— 


Age next 
Birtbdaj. 

Annual Premium* 


£ a. d. 

26 

24 12 6 

30 

27 14 2 

35 

31 11 8 

40 

36 7 6 

46 

42 6 8 


Bind by any method of interpolation what the premium would be 
at age 36 next birthday. 
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(11) Explain why the method of measuring the mortality from any 
disease by the proportion of deaths from tliat disease to deaths from 
all causes is essentially fallacious. 

Cnticiso the following mode of argument in a recent blue-book, 
containing anthropometric data with respect to school children : 
The gradation in weight from the poorest group up to the wealthiest 
is one of the most striking features of the tables. If we take all the 
children of ages from 5 to 18, we find that the average weight of 
the boy from a onc-roomed tenement is 52*6 lb. ; of the boy from a 
two-roomed tenement, 56’1 lb. ; of the boy from a three-roomed 
tenement, 60*6 lb.; of the boy from a tenement of four rooms or more. 
64 3 lb. 

(12) Show how to measure the ‘ trend ’ and the ‘ fluctuation ’ of a 
series of numbers relating to economic phenomena, such as trade or 
employment. 

(13) Find the average age and the median age of the married men 
included in the table below, and calculate one measure of dispersion. 


Agei. 

4 

Married MeD. 

1 

Widowers. 

Number of Men 
OOOV 

Average Number of 
Children nuder 16. 

Number of Men 
OuO'e. 

Under 20 

1 

•47 

• • 

1 20— 

34 

•61 

« a 

25- 

09 

•97 

1 

30— 

132 

1-50 

2 

35— 

139 

1-99 

3 

40— 

138 

1-98 

6 

45— 

130 

1-63 

7 

00— 

104 

•95 

9 

65— 

78 

•48 

11 

60— 

53 

•20 

13 

65— 

1 33 

•09 

13-5 

70— 

' 16 

•06 

10-5 

76— 

6 

•04 

7 

80— 

2 

•04 

4 


(14) What do you understand by a weighted average ? Estimate 
the average number of children of married men of all ages from the 
data in the above table. 

(16) Estimate the number of married men between the ages 
62 and 63 in the same table, and also estimate at what age the 
average number of children is a maximum. Illustrate each estimate 
by a diagram. 

(16) Define frequency group and standard deviation. Show that, 
U m'l is the second moment of any frequency group about any 

T 
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origin and the second moment about the average of the group 
and X is the average measured from the origin, then 
Calculate the standard deviation of the ages of widowers shown in the 
table of question (13). 

(17) State the product-sum formula for the correlation coefficient, 
and prove that if the means of rows and of columns of the correlation 
table lie on two straight lines the equations to these lines are 



respectively, x and y being deviations of the variables from their 
arithmetic means, r the correlation coefficient, and Cx, Vy the 
standard deviations. 

(18) Below are given the populations of the County of London and 
the four surrounding administrative counties at the Censuses of 1891 
an-i of 1901 :— 


County. 

Population. 

1891. 

1901. 

London 

1 

4,228,317 

4,536,641 

Essex .... 

578.471 

816,640 

Middlosox . 

642,894 

792.314 

Surrey 

419,115 

619,654 

Kent .... 

807,328 

936,240 

Total . 

6.576,125 

7,601,389 


(а) Assuming a constant percentage-rat© of increase in each 
administrative county, estimate its population in 1896 at a date 
midway between the two Censuses. 

(б) Assuming a constant percentage-rate of increase for the area as 
a whole (London and the four surrounding counties), estimate the 
total population at the same date in 1896. 

Why does your estimate (6) differ from the sum of the estimates 
under (a) ? 

(19) Give as exact a definition as possible of the term ‘ Cost of 
Living.’ How far can the change in the Cost of Living be measured 
over a period in which there have been considerable modifications of 
diet or other changes in consumption of necessary commodities ? 

(20) Discuss the methods of presenting wage statistics by averages. 
Illustrate by a diagram the foUowing data :— 

Building Trades.—Men. Full time eamiTigs. Median, 37s.; 
Quartiles, 29s. 6d., 40s. 6d. ; 5*9 per cent, received less than 208.1 
2'8 per cent, received 46a. or more. 

Estimate the average wage roughly. 
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(21) Construct a diagram to show graphically the relationship 
between yield of corn and rainfall from the data in the table below. 


TfiArs. 

Yield per aero 
of corn. Id bushels. 

Raiu/all io 
Jul/, in iDcbcs. 

Years. 

YioM per acre 
of corn, iu bu^ihels. 

1 

J^aiufall ia 

Juiy, in inches. 

1886 

24-5 

Mo 

1S06 

40-5 

617 

7 

19*2 

2-40 

7 

32-5 

359 

8 

35-7 

3-83 

8 

300 

2-84 

9 

32-3 

4-45 

9 

360 

3'42 

1890 

26-2 

203 

1900 

370 

415 

1 

33-5 

1-88 

1 

21-4 

2-63 

2 

26-2 

3-71 

0 

38-7 

4-78 

3 

25*7 

2-20 

3 

32-2 

3-41 

4 

28-8 

1-58 

4 

36-5 

5-23 

5 

37*4 

601 

6 

39-8 

4-78 


(22) Find the correlation between yield of corn and rainfall in the 
above table. 


(23) Define (a) arithmetic average, (6) geometric average, (c) 
median, (d) mode, (e) quartile. 

Instance cases when (6), (c) and (d) are specially appropriate. 

(24) Comment on the form of grouping adopted in (20) Table I., 
and state any inconveniences that it presents. 

Calculate approximately the values of the median and quartiles, 
using a graphic method. 


(26) Explain what is meant by the skewness of a frequency dis¬ 
tribution. Give Pearson’s measure of the skewness, and any other 
way of measuring it. 

Obtain some measure of the skewness of the distribution 
(26) Table II. 


m 


(26) Table I., showing the number of civil parishes in England and 
Wales in which the population at the Census of 1901 lay between the 
limits given in the column on the left:— 


PopaUtioQ. 

Number of 
OStU Paiisbes. 

PopulatioD« 

Number of 
Civit Parisbea. 

None 1 

26 

600 and under 750 

1,657 

1 and tinder 60' 

812 

750 „ 1,000 

842 

tS •’ 

1,339 

1,000 „ 6,000 

2,411 

^00 .. 200 

2,603 

6,000 „ 10,000 

413 

m „ 300 

2,036 

10,000 „ 20,000 

241 

300 „ 400 

^ 600 

1,410 

1,038 

20,000 and upwards 

273 

Total No. of Civil Parishes 

; 14,900 

r 
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Table II., showing tlm number of rooms measured, in a certain 
investigation, in which the size lay between the limits given in the 
column on the left; area calculated to the nearest square foot:— 


Area of Room in Square Feet. 

Number. 

Area of Room in 

Square Feet. 

Number 

20 and under 40 

3 

200 and under 220 

18 

40 

» CO 

14 

220 

240 

12 

60 

80 

16 

240 

260 

3 

80 

„ 100 

3G 

260 

280 

2 

100 

120 

31 

280 

300 

2 

120 

140 

35 

300 

320 

2 

140 

ICO 

35 

320 

340 


ICO 

180 

15 

340 

360 

• # 

180 

„ 200 

26 

360 

380 

1 



Total No. of Rooms Measured 

■ 


Write a short account of the use of graphic methods in statistics. 
Draw diagrams representing the data of Table I. and of Table II. 


(27) Define the standard deviation, and show that the mean square 
deviation is least when deviations are measured from the arithmetic 
mean. 

Find the mean and standard deviation for the sizes of the rooms 
given in (26) Table II. 

(28) What corrections are applied to the crude death-rates of 
areas in order to obtain comparable rates ? 

(29) (1) Estimated average weekly wages of agricultural labourers 
in thirty-six counties of England in 1891. and (2) the percentage of 
the population in receipt of poor law relief, in rural unions of the 
same counties, on 1st January of the same year :— 


County. 

Wagoe. 

PereenUce 
in Hceoi|)t 
of Relief. 

County. 

W»ge», 

Perwij 
in Receipt 
of Relief. 

County. 

Wagee. 

Percentice 

III Receipt 
of Kelief. 

1 

8. 

18 

d. 

6 

1*7 

13 

e. 

14 

d. 

8 

3-6 

25 


d. 

0 

4-9 

2 

18 

0 

2-3 

14 

14 

0 

31 

26 

IQ 

0 

4-7 

3 

17 

0 

2-5 

15 

14 

0 

40 

27 

IQ 

0 

89 

4 

17 

0 

2-1 

16 

14 

0 

2-3 

28 

ni 

6 

40 

5 

16 

0 

30 

17 

13 

0 

2-8 

29 

11 

6 

4-5 

6 

16 

0 

2-1 

18 

12 

0 


30 

n 

6 

4-2 

7 

15 

6 

2-8 

19 

12 

0 


31 

11 

0 

6-2 

8 

15 

6 

2-7 

20 

12 

0 


32 

11 

0 

4-2 

9 

15 

0 

3-6 

21 

12 

0 


33 

10 

6 

4-2 

10 

15 

0 

3-1 

22 

12 

0 


34 

10 

0 

3-2 

11 

; 15 

0 

31 

23 

12 

0 

4-9 

35 

10 

0 

4-4 

12 

15 

0 

2-7 

24 

12 

0 

4-2 

36 

10 

0 

4-8 
















MISCELLAisEOUS EXAMPLES 


293 


Define the arithmetic mean, the median and the mode, and give a 
sketch of a skew frequency distribution showing the approximate 
position of each. 

State the chief advantages of the arithmetic mean as a form of 
average, and find the arithmetic mean ami the median for the wages 
of agricultural labourers in the above table. 

(30) Explain clearly the meaning of the term ‘ dispersion,’ and 
find the mean deviation from the median for wages in the same 
table. 


(31) Also, define standard deviation, and find the standard 
deviation of these wages. 

(32) Using the data in question (29), test graphically, wnth squared 
paper, the correlation between average wages and percentages of the 
population in receipt of poor law relief, stating your conclusions in 
words. 


(33) Construct a blank tabic, complete with lieadings and lines, and 
with due regard to spacing, in which could be inserted the numbers 
of persons employed in six groups of industries, four grades of age at 
three different periods. 


(34) The following table gives for 780 weeks the call discount rate 
and the ratio of reserves to deposits in New York. Calculate the 
average discount rate for the various ratios of reserves to deposits, 
and express the results graphically. 


Call Discount Rates. 



1- 

8- 

s- 1 

4- 



7- 

8- 1 9- 10 ^ 

12 



1 26 

TgU)a. 

5 

m 

8. 

iS 

s 

I 

1 

o 

.s 

1 

21V.- 

2:57.- 

2oV.- 

27"/.- 

2ia- 

31*/.- 

ss*/.- 

35V.- 

37V.-' 

30*/.- 

41V,- 

43-/.- 

45V.- 

6 

25 

47 

30 

22 

18 

30 

20 

2 

2 

^ • s 

72 

87 

20 

Ci 

2| 

2 

99 9 

999 

9 9 9 

... 1 

1 

9 99 

4 9 • 

9 

57 

31 

11 

1 

99 • 

999 1 

♦ 9 1 

j :::::: 

1 rfv- :::::: 


I 

» « 1 

9 

7 

2 

4 »9 

9 9 « 

9 » 9 

9 # 9 

9 » 9 

9 

99 9 

9 • 9 

3 

6 

1 1 

9 9 9 

9 • • 

99 9 

<99 

99 9 

• • 9 

1 

1 

14 * 

3 

99 9 

♦ 99 

♦ « 9 

• 99 

9 < 9 

99 • 

• 99 

9 • 9 

9 • • 

9 •♦ 

1 • • 

2 

4 

1 

999 

99 • 

9 99 

999 

< 9 ^ 

99 9 

• 9 9 

99 • 

• • 9 

1 

1 

• « » 

2 

1 

• s • 

• « • 

9 « 9 

< « 9 

• • s 

9 9 9 

• < 9 

4 • 9 

1 

9 9 9 

9 9 9 

1 

9 19 * 

9 99 

• 9 9 

• 99 

99 9 

9 9 9 

9 9 9 

9 9 9 

• 99 

1 

2 

9 9 9 

1 1 

• 99 , 

99 9 

4 9 • 

• 99 

• 99 

9 9 9 

• 9 9 

99 # 

3 

10 

127 

2:i9 

102 

89 

40 

24 

20 

30 

20 

2 

2 

ToUli, . 

214 

195 

109 

97 

72 

45 

18 

10 

4 

7 

1 

3 

1 

4 

m 


The heading 16- covers all rates of 15 and over but less than 20^ 
etc. 


From the following data find the equation of the regression line 
giving the average discount rate for all ratios of reserves to deposits. 
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aiifl plot the line on the same diagram. Is the use of the product 
moment method of determining correlation justifiable in this case ? 



1 

1 Means. 

' 1 

Standard Deviations. 

Correlation. 

Call Di^jcount R itus 

Ratios of Ro5('rvas to Do posits 

30 

30-3 

1 

2-5 

4-2 

1 —62 


(35) As an illustration of the nature of definitions in statistics 
explain fully the meaning of the statement: ‘The total value of 

export.s (produce and manufactures of tho United Kingdom) in 1918 
was £498.473,066.’ ^ 
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PART II 

(1) Give a short account of the chief official publications, in 
England, relating to statistics of one of the following sulijccts, with 
especial reference to the source and the precise mt'aning of the data 

(а) Vital statistics (births, deaths and marriages). 

(б) Foreign trade. 

(c) Agriculture. 

(2) What do you understand by the words ‘ frequency group ’ ? 

England and W.vles, 1911 

Ages . , . . 10- 15- 20- 25- 35- 45- 65- C5- 

All Occupied (Males 000s.) 2-16 1164 1146 2225 1815 1262 723 299 

Coal Hewers (OOs.) . . 63 338 538 10G7 /98 467 212 aO 

Compute suitable averages and measures of dispersion for the 
comparison of the age group.s in this table and comment on the results. 

(3) What means are available for testing the significance of 
differenc es between statistical coefficients 1 Test whether the 
differences between the means and measures of di.spcrsion for the 
two scries given in the previous question are significant. 

(4) 3 , =4-63 and .«2=3'71 are the standard deviations of two groups. 
Xp Xz . . . x„, and yp ■ Vn- 5xy=4S32. n=1000. 

Explain exactly the meaning of standard deviations. Laicu- 
late the product-sum coefficient of correlation between the groups, 
and state what it measures. Write down the probable error of the 
coefficient and explain its meaning. 

(6) Find the standard deviation of the differences between corre¬ 
sponding values of two variates x and y. 

(6) Set out in detail the method by which you would make graphic 
comparisons of two such series of figures as Imports of Manufactures 
and Unemployment. 

(7) If the recorded births in a certain district may be in defect by 
X per cent., and the estimated population in error by ±y per cent., 
find an approximate expression for the greate.st possible error m the 
birth-rate, x and y being assumed fairly small (say, not more than 

6 per cent, or so). 

(8) Given five thousand different figures— e.g. quotations of pri^s, 
or measurements of human statures—how would you (o) select five 
hundred figures at random from that total, and (5) ascertam the 
probability that the average of the five hundred selected figures does 
not differ from the average of the five thousand by more than any 
assigned extent ? 
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( 9 ) («) 


Number of Person* 

% 


Number of Rconi' 

' per Tenement. 


Total. 

i 

1 

Approi. 

Averej^e 

Numl>©r 

of 

Itoome. 

l>er Tenemrni. 

2 

1_ 

8 

1 4 

5 

4 

r 

1 ' 

8 


10 

'<>T luore. 

1 


9 

' 8 

5 

1 1 

4 

1 

1 

1 

i 

55 

3*5 

<> 

14 

21 


: 4^» 

29 

I 9 

8 

3 

4 

191 

4-8 

3 


21 

i 

57 

34 

* n 

8 

3 

5 

20S 

5-0 

4 

4 


57 

44 

27 

10 

3 

2 

10 

179 

! 5-2 

5 

3 

IS 

{ 3^; 

42 

21 

1 I 

5 

5 

12 

148 

5*3 

6 

2 

8 

i 21 

43 

19 

4 1 

5 

3 

7 

112 

5*4 

7 


5 


14 

1 11 

5 

3 

2 

4 

02 

6-6 

8 


1 

1 14 

1 

11 1 

1 

1 10 

1 

» * • 

2 

2 

4 

1 

44 

5-7 

Totals, . . 1 

54 

105 ( 

I 

275 

* * \ 

25(1 

l.'-C’ 

55 

35 

21 

47 

1000 

6-1 

Averag** Number 
of Persona, 

207 

3-56 

394 

4-24 

1 

4-24 

3-78 

4*14 I 

402 

4*81 

1 

3*976 


Standard deviations : persons 1-83. rooms 1-9 (approx.). 

(6) Show that the coefTicient of correlation can be expressed in the 
form 

where £, y are the averages of the observations referred to any origin. 

(c) Calculate, by any method, a measurement of the correlation 
between the number of rooms and number of persons per tenement 
shown in (a). 

(d) Calculate the third and fourth moments of the frequency 
curve of persons ; determine the position of the mode and also 
determine the skewness by an}’ method known to you. 

(10) The table given below gives the results of the measurement of 
scries of 1)59 0.\ford Students and of 2348 convicts. Find what, if 
any, differences between the statistical constants given are significant 
and comment on the results. 


Character. 

Data. 

Mcaos. 

Standard 

Deviations^. 

CoefGcients of Correlation 
with Stature. 

Head 

rSludents 

196-05 

6-23 

•31 

deogth 

\Convict3 

192-.14 

6-39 

•26 

Head 

f Students 

152-84 

4-92 

•14 

•brcadtli 

\Convicts 

151-02 

5-49 

•15 

Head 

/Students 

136-62 

5-80 

•28 

•height 

1 \CoDvicts 

1 

132-29 

1 

5-21 

•19 

Stature 

/Students 

\Convicts 

60-49 

65-44 

2-60 

2-65 

1 

> 

1 
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(11) Give a short account of the nature of the information con¬ 
tained in out of the following : Census of Production, 1907 ; Reports 
on Wages in 1906 (the ‘ Wage Census ’) ; Reports on Buildings and 
Tenements (housing and overcrowding) in the Population Census, 
1911. 


(12) Outline a method by which the normal curve of error can be 
obtained as the limit of (p+?)". 

(13) Discuss (a) the best means of obtaining accurate statistics of 
family expenditure, and ( 6 ) the best means of combining such data so 
as to form a representative type. 

(14) Define the following terms and give illu.strations of their use : 
interpolation, standard deviation, moment, skewness, logarithmic 
scale, geometric mean, partial correlation, normal curve of error. 

(15) In m trials an event has happened r times. How would you 
determine the probability that this result is consistent with the 
hypothesis of random sampling from a universe in which the chance 
of the event happening is a certain small quantity p 1 Why cannot 
the required probability be derived from a table of the normal curve 
of error ? 


(16) If m^, mg are the numbers of deaths occurring in a year among 
Nj, Ng persons of two different occupations, the standard deviation 
by which the significance of the observed difference in the death rates 
per 1000 can be tested is given as 


Show how this formula is obtained and criticise it. 


(17) Contrast the methods used in the construction of any two 
current index numbers of wholesale prices. Under what conditions 
is ‘ weighting ’ important in index numbers ? 

(18) Analyse in some detail the cases in which it may be assumed 
( 1 ) that a frequency distribution is normal, or ( 2 ) that the proba¬ 
bility of errors in a measurement or observation exceeding various 
amounts is determinable by the normal table of probability. 

(19) What methods are available for testing the ‘ goodness of fit 
of a mathematical curve to observations ? 

(20) If z=a:i-f Xg-h . . . +*„, where the x’s are deviations from 
the average of quantities selected at random and independently of 
each other from a curve frequency whose standard deviation is a, 

show that the standard deviation of z is — 7 =. n being finite. 

■Jn V 4 

Under what circumstances can it be shown that the curve 01 

frequency of z is normal ? 

(21) What methods are available for classifying frequency curves 
mto typM ? State briefly the mathematical concepts underlying the 

classification of frequency curves. 
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(22) Explain how the necessity for Sheppard’s corrections of 
moments arises. 

If is the second moment calculated from observations when all 
are supposed to be grouped at the middle of grades whose breadth is 

/i, sliow that second moment if it is assumed that the 

observations arc evenly distributed through the grades. 

(23) A sample containing 1000 is draw'n at random from a large 
uruverse and 300 are found to possess a certain attribute. Can you 
infer anything as to the proportion in the universe tliat have this 
attribute, or what further information is needed ? 

(24-) Discuss the effect on a weighted mean of errors in the 
quantities or the weights. 

(2.5) Write a brief note on the assumptions made in calculating the 
probable error of a statistical quantity, such as the standard deviation 
or the correlation coeflicient. 


(26) Calculate the average, second, third and fourth moments, 
mean deviation, standard deviation and skewness of the frequency 
gi oups of chest girths shown in the following table :— 

Heionr ami. Chmt Gjhtii of U2G Rbcruiis of 18 Ylam of Am. 


Chest Glrtb 
in luchee. 


26 

2D 

30 

31 

32 

33 

34 

35 
3G 
37 
33 
3D 


fleight in inches. 


60 


0 ) 


1 

2 

8 

6 

6 


Totals. . . 

Averages of 
Arrajs, 
Standard 
DeviatioDd 

of Arrays, . 


23 
[32-7 

1-75 


1 

D 

18 

II 

10 

6 

4 

1 

2 


62 


1 

I 

3 

8 

24 

21 

15 

15 

3 


08 
1335 

1-87 


Dl 

33-5 

1-57 


03 


64 


3 

2D 

30 

43 

25 

6 

3 

1 


MO 

34-2 

•34 


1 

4 

30 

42 

52 

33 

11 

1 

1 


65 66 


180 

[341 

1*33 


6 

12 

22 

43 

32 

22 

4 

1 


143 


34-7 B4-7 


1-50 


1 

7 

16 

30 

21 

2D 

18 

12 

4 


144 


67 


3 

6 

17 
40 
211 

18 
C 
3 


121 

35*0 

40 


6S 69 


2 

2 

8 

8 

28 

18 

18 

8 

4 

1 


97 
(35-1 


77 


70 


2 

2 

9 

12 

10 

19 

6 

4 

1 


1 

1 

14 

6 

6 


71 

35-5 

167 


30 

363 

1*3 


71 72 


1 

3 

3 

2 

2 

1 

2 


1 

1 

1 
• % 
1 


ToUh 


14 

35-4 

1-8 


4 

34-5 

2-2 


1 

2 

9 

47 

159 

207 

280 

219 

127 

49 

22 

4 


1126 

34-51 

1-66 


--- -;__ » I I I » I 4 

height, 65-6 ; standard deviation, 2*52. 

., he relations between the quantities calculated with those 

that^are found m the normal curve of error. 

(27) From the above data draw the regression line (chest girth on 
yt^th the help of your drawing find an approximate value 
0 coefficient of correlation between height and chest girth. 
_^ 0 £nial distributions the standard deviation of an array is 
o-j s/l r* where a-^ is tlie standard deviation of the arrays merged in 
one group. Are the standard deviations shown in the table con¬ 
sistent with this formula ? 
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